uchime_ref command

Chimera detection using the UCHIME algorithm. See UCHIME score for parameters.

A database file of nucleotide sequences must be specified using the ‑db option. The database may be in FASTA or UDB format. UDB format is faster to load. The reference database should include sequences that might appear as parents in the query set. These should be high-quality sequences that are believed to be free of chimeras. Errors in reference sequences may increase the number of both false positives and false negatives. Chimeras will not be detected if their parents (or sufficiently close relatives) are not present in the database.

The uchime_ref_minpctid option specifies a minimum identity for classification. Either the top hit to the database or a chimeric model constructed from two segments in the database must match the query sequence with at least this identity, otherwise the query is considered to be unclassified. Default 95.0 (version 8.1.1811 or later).

Note that in a OTU clustering pipeline, the cluster_otus command includes chimera filtering that is usually significantly more sensitive than uchime_ref. Given that false positives of uchime_ref may cause valid OTUs to be discarded if uchime_ref is used as a pre-processing step, It is suggested that uchime_ref be used only as a post-processing step to detect chimeric OTUs. However, optimal parameters and the best place to use uchime_ref in a OTU pipeline have not been studied, so this is an open issue.

16S reference databases
It is not recommended to use a large database like SILVA or Greengenes as a reference database as these contain many low-quality sequences which degrade detection accuracy. Suggested databases are given in the table below. Files are in FASTA format with two copies of each reference sequence, one on each strand. This is because uchime_ref only searches on the plus strand (-strand both is not supported).

Download link		Description
CS Gold		ChimeraSlayer reference database from the Broad Microbiome Utilities version microbiomeutil-r20110519. Contains 5,181 reference sequences. This has not been updated in several years, so the RDP database (below) is recommended as an alternative.
RDP Gold		RDP classifier training database (v9). Contains 10,049 reference sequences.

The ‑strand option is required. Currently this must be specified as -strand plus.

Multithreading is supported.

See uchime output files for output options.

The ‑self and ‑selfid options specify that a reference sequence matching the query sequence should be ignored. This is useful for estimating the false-positive rate using a database of sequences known to be free of chimeras. With ‑self, matching is done by the sequence label, with ‑selfid matching is done from an alignment (a 100% match is ignored).

Example

usearch -uchime_ref reads.fasta -db 16s_ref.udb -uchimeout results.uchime -strand plus