OTU QC: Align to a reference database
for OTU sequences
I strongly recommend aligning your OTUs to a large reference database and
reviewing some selected alignments. For most samples, use the largest available
reference database for your gene. e.g. SILVA for 16S or UNITE for ITS. If
you have control samples (mock community or single strain), then you should
align the control OTU sequences to the known sequences in the strain(s).
You can select a random subset to review by using the
fastx_subsample command. I recommend
reviewing the top few most abundant OTUs, and also a few of the least
abundant. These alignments are especially likely to reveal problems.
Here is a typical command for making the alignments.
usearch -usearch_global otus.fa -db silva.udb -id 0.9 -strand both \
-alnout otu.aln -uc otu.uc
Review the alignments in the otu.aln file. Look for patterns which
might not be biological. For example, if you see that many OTUs have
mismatches near the beginning or end of the sequence, this may be due to
primer-binding sequences which should
be stripped because PCR tends to cause substitutions.
To get a quick overview of the identity distribution, sort field 4 of the
.uc file (%id, see uc file
cut -f4 otu.uc | sort -g | uniq -c > id_dist.txt
Do you have many OTUs with 100% id to your reference database? If not, you
should figure out why. Review the alignments for some of the OTUs with the highest identity. You
may notice that most of these OTUs have mismatches at the same positions, which
suggests that there are non-biological sequences in the reads which need to
be stripped, e.g. a machine-specific
sequence like the 454 TCAG calibration sequence.
The strand is the 5th field in the uc file, so to check that all OTUs are on
the same strand:
cut -f5 otu.uc | sort | uniq -c
If you find that you have OTUs on both strands, then you need to
orient the reads onto the same strand
before quality filtering and re-build the OTUs.