uparse_ref command

Annotation of amplicon sequences using the UPARSE-REF algorithm. This command is designed for use in validating mock community sequencing experiments.

A database file of nucleotide sequences must be specified using the ‑db option. The database may be in FASTA or UDB format. The reference database should include all biological sequences that are expected to appear in the input set. The database should be complete and correct as far as possible, and should not be any larger than necessary. Do not use a large reference database such as Greengenes, SILVA or the gold reference database for UCHIME.

The main use for uparse_ref is to annotate reads, OTUs and other sequences generated from mock community experiments where the biological sequences in the sample are known.

The uparse_ref command does not perform well on benchmarks developed to validate ChimeraSlayer and UCHIME. It is not designed for use as a general-purpose chimera detection or chimera filtering method.

The ‑strand option is required and must be specified as -strand plus. This means that the database must be oriented on the same strand as the query sequences (or contain both forward and reverse-complemented reference sequences).

The -uparseout option specifies a tabbed text output file documenting how the input sequences were classified..

The -fastaout option specifies a FASTA output file containing all input sequences with labels annotated according to their UPARSE-REF models. Generally, the -uparseout file is recommended because it is easier to understand and parse, but the -fastaout file provides more information

The -uparsealnout option species a text file containing a human-readable alignment of each query sequence to its UPARSE-REF model.

Parsimony score options are supported.

Alignment parameters and heuristics are supported.

Multithreading is supported.

Example

usearch -uparse_ref otus.fasta -db mock_ref.udb -strand plus -uparseout out.up