USEARCH manual

OTU QC: Align to a reference database

I strongly recommend aligning your OTUs to a large reference database and reviewing some selected alignments. For most samples, use the largest available reference database for your gene. e.g. SILVA for 16S or UNITE for ITS. If you have control samples (mock community or single strain), then you should align the control OTU sequences to the known sequences in the strain(s).

You can select a random subset to review by using the fastx_subsample command. I recommend reviewing the top few most abundant OTUs, and also a few of the least abundant. These alignments are especially likely to reveal problems.

Here is a typical command for making the alignments.

usearch -usearch_global otus.fa -db silva.udb -id 0.9 -strand both \
-alnout otu.aln -uc otu.uc

Review the alignments in the otu.aln file. Look for patterns which might not be biological. For example, if you see that many OTUs have mismatches near the beginning or end of the sequence, this may be due to primer-binding sequences which should be stripped because PCR tends to cause substitutions.

To get a quick overview of the identity distribution, sort field 4 of the .uc file (%id, see uc file format).

cut -f4 otu.uc | sort -g | uniq -c > id_dist.txt

Do you have many OTUs with 100% id to your reference database? If not, you should figure out why. Review the alignments for some of the OTUs with the highest identity. You may notice that most of these OTUs have mismatches at the same positions, which suggests that there are non-biological sequences in the reads which need to be stripped, e.g. a machine-specific sequence like the 454 TCAG calibration sequence.

The strand is the 5th field in the uc file, so to check that all OTUs are on the same strand:

cut -f5 otu.uc | sort | uniq -c

If you find that you have OTUs on both strands, then you need to orient the reads onto the same strand before quality filtering and re-build the OTUs.