OTU QC: Align to a reference database
Home Software Services About Contact     
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.



OTU QC: Align to a reference database

See also
  Quality control for OTU sequences

I strongly recommend aligning your OTUs to a large reference database and reviewing some selected alignments. For most samples, use the largest available reference database for your gene. e.g. SILVA for 16S or UNITE for ITS. If you have control samples (mock community or single strain), then you should align the control OTU sequences to the known sequences in the strain(s).

Check highest and lowest abundance, shortest and longest sequence
You can select a random subset to review by using the fastx_subsample command. I recommend reviewing the top few most abundant OTUs, and also a few of the least abundant. These alignments are especially likely to reveal problems. Also check the shortest and longest sequences.
Here is a typical command for making the alignments.

usearch -usearch_global otus.fa -db silva.udb -id 0.9 -strand both \
  -alnout otu.aln -uc otu.uc

Review the alignments in the otu.aln file. Look for patterns which might not be biological. For example, if you see that many OTUs have mismatches near the beginning or end of the sequence, this may be due to primer-binding sequences which should be stripped because PCR tends to cause substitutions.

To get a quick overview of the identity distribution, sort field 4 of the .uc file (%id, see uc file format).

cut -f4 otu.uc | sort -g | uniq -c > id_dist.txt

Do you have many OTUs with 100% id to your reference database? If not, you should figure out why. Review the alignments for some of the OTUs with the highest identity. You may notice that most of these OTUs have mismatches at the same positions, which suggests that there are non-biological sequences in the reads which need to be stripped, e.g. a machine-specific sequence like the 454 TCAG calibration sequence.

The strand is the 5th field in the uc file, so to check that all OTUs are on the same strand:

cut -f5 otu.uc | sort | uniq -c

If you find that you have OTUs on both strands, then you need to orient the reads onto the same strand before quality filtering and re-build the OTUs.