Chimera detection using the UCHIME
algorithm. See UCHIME score for
file of nucleotide sequences must be specified using the ‑db
option. The database may be in FASTA or UDB format. UDB format is faster to
load. The reference database should include sequences that might appear as
parents in the query set. These should be high-quality sequences that are
believed to be free of chimeras. Errors in reference sequences may
increase the number of both false positives and false negatives. Chimeras will not be
detected if their parents (or sufficiently close relatives) are not
present in the database.
The uchime_ref_minpctid option specifies a minimum
identity for classification. Either the top hit to the database or a chimeric
model constructed from two segments in the database must match the query
sequence with at least this identity, otherwise the query is considered to be
unclassified. Default 95.0 (version 8.1.1811 or later).
Note that in a OTU clustering pipeline, the
cluster_otus command includes chimera filtering that is usually significantly
more sensitive than uchime_ref. Given that false positives of uchime_ref may
cause valid OTUs to be discarded if uchime_ref is used as a pre-processing step,
It is suggested that uchime_ref be used only as a post-processing step to detect
chimeric OTUs. However, optimal parameters and the best place to use uchime_ref
in a OTU pipeline have not been studied, so this is an open issue.
16S reference databases
It is not recommended to use
a large database like SILVA or Greengenes as a reference database as these
contain many low-quality sequences which degrade detection accuracy. Suggested
databases are given in the table below. Files are in FASTA format with two copies of each
reference sequence, one on each strand. This is because uchime_ref
only searches on the plus strand (-strand both is not supported).
ChimeraSlayer reference database from the
Broad Microbiome Utilities
version microbiomeutil-r20110519. Contains 5,181 reference sequences. This has
not been updated in several years, so the RDP database (below) is recommended as
training database (v9). Contains 10,049 reference sequences.
option is required. Currently this must be specified as -strand plus.
Multithreading is supported.
See uchime output
files for output options.
‑self and ‑selfid options specify that
a reference sequence matching the query
sequence should be ignored. This is useful for estimating the false-positive
rate using a database of sequences known to be free of chimeras. With ‑self,
matching is done by the sequence label, with ‑selfid matching is done from an
alignment (a 100% match is ignored).
usearch -uchime_ref reads.fasta -db
16s_ref.udb -uchimeout results.uchime -strand plus