USEARCH manual

UPARSE pipeline

A UPARSE pipeline clusters NGS amplicon reads into OTUs using the cluster_otus command. This page discusses the pre- and post-processing steps that are typically required to get the best results from cluster_otus in practice.

It is not possible to give a single set of command lines that will work for all reads because there are many variations in the input data, especially of read layouts.This page summarizes steps that are usually performed by a pipeline, and provides links to further discussion and details of the command lines that can be used. You could start from these examples.

Reads with Phred (quality) scores
I strongly recommended starting from "raw" reads, i.e. the reads originally provided by the sequencing machine base-calling software. Phred scores should be retained, and you should do quality filtering with USEARCH rather than using reads that have already been filtered by third-party software. Start by converting to FASTQ format, if needed. If you have 454 reads in FASTA + QUAL format, you can use the faqual2fastq.py script to convert.

Sample pooling
I recommend combining reads from as many samples as possible if they contain similar communities and / or if you are planning to compare samples using measures such as beta diversity. See sample pooling for discussion.

Read quality filtering
Quality filtering of the reads should be done using USEARCH because I believe that the maximum expected error filtering method is much better than most other filters, e.g. those based on average Q scores. I recommend choosing quality filtering parameters manually for each run based on Phred score statistics.

Demultiplexed Illumina reads
See here for discussion of adding sample labels to demultiplexed Illumina reads, i.e. reads that are already split into separate FASTQ files by barcode/sample identifier.

Flowgram denoising
If you have 454 reads, then as an alternative to quality filtering with USEARCH you can generate FASTA format reads by denoising flowgrams using a third-party algorithm [Pubmed:20805793, Pubmed:19668203]. This may give a small improvement in OTU sequence accuracy compared to Phred score quality filtering, but denoising can be very computationally intensive and I generally don't consider it worth the effort. If you do choose to use denoising, then you should convert the output from the denoising program so that size annotations are added to the labels in USEARCH format, remove barcodes (adding sample identifiers to the read labels), and then skip ahead to the abundance sort step below. If you use a denoising package (e.g. AmpliconNoise) that includes a chimera filter (Perseus in the case of AN), then you should turn off the chimera filter, i.e., extract denoised reads before any chimera filtering step.

FASTA reads
See "Flowgram denoising" above if the FASTA reads were produced by a denoising program. If "raw" reads or reads that have been quality-filtered by a third-party program are only available in FASTA format, then you should start by trimming them to a fixed length, unless the reads contain full-length amplicons, in which case this step may not be necessary. See global trimming for discussion. Since quality information is not available, you cannot choose the trim length based on predicted error rates. Instead, you could choose a value that is, say, a few percent longer than the average length in order to maximize the number of bases retained. However, you should be cautious here because quality tends to get worse towards the end of a read. For example, if you have 454 reads that are, say, 400 bases or longer, then it might be better to truncate to a shorter length, e.g. 250 or 300 bases as this could substantially reduce the error rate.

Length trimming
Trimming to a fixed position is critically important for achieving the best results. For unpaired reads, trim to a fixed length. For overlapping paired reads, the reverse read should start at an amplification primer, which achieves an equivalent result. The important point is that identical or very similar reads must be globally alignable with no terminal gaps. See global trimming for discussion.

Paired reads
Paired reads should be merged using the fastq_mergepairs command before quality filtering. The only quality filtering that should be done at this stage is truncating reads at the first low score using the -fastq_truncqual option to fastq_mergepairs, otherwise you may find that many pairs will not align to each other due to poor quality tails in the reads. After merging, you can use fastq_filter with a maximum expected error threshold. Length truncation is typically not needed since the merged pairs usually cover full-length amplicons (see global trimming for discussion).

Barcodes
Barcodes and any other non-biological sequence must be stripped from the reads before dereplication. This can be done using the fastq_strip_barcode_relabel.py script or any other convenient method. The barcode must be removed before dereplication to allow finding of identical sequences derived from the biological tag. The barcode sequence or label should be inserted into the read label so that the read can later be mapped back to an OTU. It is recommended to strip barcodes and other non-biological sequence before quality filtering.

Dereplication
Input to dereplication is a set of reads in FASTA format with non-biological sequences such as barcode stripped. The reads should be globally trimmed before dereplication, and quality filtered if possible as described above. You should use the derep_fulllength command with the -sizeout option. I recommend pooling samples before dereplication.

Abundance sort
Use the sortbysize command to sort the dereplicated reads by decreasing abundance. To discard singletons (usually recommended), use the ‑minsize 2 option.

OTU clustering
To create OTUs, run the cluster_otus command with abundance-sorted reads as input. This will generate a set of OTU representative sequences.

Chimera filtering
The cluster_otus command discards reads that have chimeric models built from more abundant reads. However, a few chimeras may be missed, especially if they have parents that are absent from the reads or are present with very low abundance. It is therefore recommend to use a reference-based chimera filtering step using UCHIME if a suitable database is available. Use the uchime_ref command for this step with the OTU representative sequences as input and the ‑nonchimeras option to get a chimera-filtered set of OTU sequences. For the 16S gene, I recommend the gold database (do not use a large 16S database like Greengenes). For the ITS region, you could try using the UNITE database as a reference.

Labeling OTUs
At this stage the OTU sequence labels are usually the original read label with a size annotation appended. Note that this is the size of the dereplication cluster, i.e. the number of reads having this unique sequence, not the number of reads assigned to the OTU (see Making a OTU table below). It is therefore useful to generate a new set of labels for the OTUs, e.g. OTU_1, OTU_2 ... OTU_N where N is the number of OTUs. This can be done using the fasta_number.py script.

Creating an OTU table
To create an OTU table, you should first map reads to OTUs. Then you can use the uc2otutab.py script to generate the OTU table.

Taxonomy assignment
You can use the utax command (requires USEARCH version 8) to assign taxonomy to the OTU representative sequences.