Making an OTU table (mapping reads
Should I use UPARSE or UNOISE?
The cluster_otus command performs
97% OTU clustering using the UPARSE-OTU algorithm.
For most purposes, I consider 97% OTU clustering
obsolete. It is better to use the unoise command
to recover the full set of biological sequences in the reads. These are also
valid OTUs; I call them "ZOTUs" for zero-radius OTUs, to emphasize this. See
the UNOISE paper for full discussion.
Input to cluster_otus is a
FASTA file containing quality filtered,
globally trimmed and
dereplicated reads from a marker gene amplicon
sequencing experiment, e.g. 16S or ITS. It is generally recommended that
singleton reads should be discarded. See
UPARSE pipeline for discussion of how to prepare reads before clustering.
Reads must be
globally trimmed before finding unique sequences. See
Input sequence labels must have size annotations
giving the abundance of the unique sequence. Size annotations are generated by
the -sizeout option of clustering commands; typically
fastx_uniques is used.
The -sizein and -sizeout options are no longer
supported by cluster_otus because they were misleading for evaluating the
results. To determine the number of reads in each OTU, it is better to
make an OTU table using reads before
quality filtering and deleting singletons, which recovers many (usually,
most) of the reads that were discarded. Using -sizein and -sizeout can give
the impression that UPARSE discards a large fraction of the reads, which is
usually not the case if you use my recommended approach.
The -minsize option can be used to specify a minimum abundance; for
example you can use -minsize 2 to discard singletons.
The -otu_radius_pct option specifies the OTU "radius"
as a percentage, i.e. the maximum difference between an OTU member
sequence and the representative sequence of that OTU. Default is 3.0,
corresponding to a minimum identity of 97%. It
not recommended to use a
non-default value; see
UPARSE OTU radius for discussion and
solution for making OTUs at different identities.
The -otus option specifies a FASTA output file for the
OTU representative sequences. By default, OTUs labels are taken from the input
file, with size annotations stripped. The -relabel
option specifies a string that is used to re-label OTUs. If -relabel xxx is
specified, then the labels are xxx followed by 1, 2 ... up to the number of
OTUs. OTU identifiers in the labels is required for
making an OTU table using usearch_global
The -uparseout option
specifies a tabbed text output file documenting how the input sequences were
The -uparsealnout option species a text file
containing a human-readable alignment of each query sequence to its
Parsimony score options
Alignment parameters and
heuristics are supported.
usearch -cluster_otus derep.fa -otus
otus.fa -uparseout out.up -relabel OTU -minsize 2
all_reads.fa -db otus.fa -strand plus -id 0.97 -otutabout otu_table.txt