A database file of nucleotide sequences must be specified using the -db option. The database may be in FASTA or UDB format. UDB format is faster to load. The reference database should include sequences that might appear as parents in the query set.
Don't use UCHIME(2) for OTU clustering or denoising!
I do not recommend using uchime2_ref or uchime2_denovo in an OTU clustering pipeline because of the risk of false positives. The cluster_otus and unoise commands have built-in de novo chimera filtering which works very well for most data.
It is usually strongly recommended to use the largest possible database, e.g. SILVA for 16S or UNITE for ITS. The advice to use a small, high-quality database in the first UCHIME paper and in previous versions of the USEARCH manual was wrong!
The following output files are supported:
-uchimeout out.txt (tabbed text)
-chimeras ch.fa (FASTA file with predicted chimeras)
-notmatched not.fa (FASTA file with sequences not matched to the database)
-uchimealnout aln.txt (alignments)
The -nonchimeras option is no longer supported. This is because it is not possible to determine that a sequence is non-chimeric, the best we can say is that it is found / not found in the database (the reasons are explained in the UCHIME2 paper). The -notmached output is the equivalent for uchime2_ref, but as with uchime_ref you should not interpret the output as containing non-chimeric sequences!
The -mode option is required, must be one of:
Report chimera predictions which with confidence, at the expense of a high false negative rate.
Report chimera predictions which with confidence, at the expense of a high false-negative rate. Similar to high_confidence mode, but less stringent so the false negative rate is lower but the false positive rate may be higher. Gives results similar to the old UCHIME algorithm.
Attempts to balance false negatives and false positives to minimize the overall error rate on typical data. Of course, the rates are highly data-dependent.
Emphasizes high sensitivity at the expense of a high false positive rate.
Reports all perfect chimeric models. Mostly used for designing and validating algorithms -- this mode is rarely, if ever, useful in practice because the database is implicitly assumed to be complete (i.e., all parent sequences are exactly present) and the query set and database are both assumed to have no errors. A single difference will prevent the model from being reported, causing false negatives. Conversely, fake models are common, causing false positives (see UCHIME2 paper for details).
The -strand option is required. Currently this must be specified as -strand plus because searching on both strands is not supported.
Multithreading is supported.
The -self option specifies that a reference sequence matching the query sequence should be ignored. This is useful for estimating the false-positive rate using a database of sequences known to be free of chimeras. Then, -self does a leave-one-out test. The -self option requires that the query and database are the same file.
R. C. Edgar (2016), UCHIME2: Improved chimera detection for amplicon sequences, http://dx.doi.org/10.1101/074252,
usearch -uchime_ref reads.fasta -db 16s_ref.udb -uchimeout out.txt -strand plus -mode sensitive