OTU QC: All OTUs should appear in the OTU table
Home Software Services About Contact     
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.



OTU QC: All OTUs should appear in the OTU table

See also
  Quality control for OTU sequences

All OTU sequences should be found in your OTU table. Here is how to analyze the problem if some are missing.

The three main explanations are:

E1. Some reads matches more than one OTU at 97%, in which case only one of these OTUs may appear in the OTU table.

E2. Some OTUs have matches in the reads, but these matches are not found due to false negatives in the search performed by otutab.

E3. There is a mistake in your pipeline design causing some reads to be included when making the OTUs but excluded when making the OTU table.

Identify the missing OTU sequences
Find the labels of the missing OTUs and make a FASTA file missing.fa with the sequences. You can do this by cutting the labels out of the OTU table and grepping the labels from the FASTA file with the OTU sequences, then using the Linux uniq command to find labels which appear only in the FASTA file. Finally, use the fastx_getseqs command to extract those sequences. For example,

cut -f1 otutab.txt | grep -v "^#" > table_labels.txt
grep "^>" otus.fa | sed "-es/>//" > seq_labels.txt
sort seq_labels.txt table_labels.txt table_labels.txt | uniq -u > missing_labels.txt
usearch -fastx_getseqs otus.fa -labels missing_labels.txt -fastaout missing.fa

Strand duplicates and offsets
Common causes of (E1) are strand duplicates and offsets. You can check for these problems by aligning the missing OTUs to the original OTUs with the missing OTUs removed. If there are matches, then this is most likely due to either strand duplicates or offsets. This will be clear from the alignments: a strand duplicate will align on the negative strand and an offset will align with terminal gaps, which can be seen by noting that the start of the alignment in one of the sequences is not at position 1. To make the database with the not-missing OTUs, you can use the Linux uniq command again, as follows.

sort missing_labels.txt missing_labels.txt seq_labels.txt | uniq -u > notmissing_labels.txt
usearch -fastx_getseqs otus.fa -labels notmissing_labels.txt -fastaout notmissing.fa

Then align the missing OTUs to the not-missing OTUs and examine the alignments:

usearch -usearch_global missing.fa -db -notmissing.fa -strand both -id 0.97 \
  -uc missnot.uc -alnout missnot.aln

False negatives by otutab
If only a few OTUs are missing, say one or two, then this may be due to (E2), i.e. a problem with the otutab command. It uses heuristics to optimize speed, and in rare cases it may fail to find a match between a query sequence and an OTU sequence. You can check this by generating an OTU table using missing.fa instead of otus.fa. If you now get non-zero counts for the missing OTUs, then one possible explanation is false negatives. It may also be due to offsets or strand duplicates as discussed below. If the problem is due to search heuristics, then you should be able to fix the problem by setting the -maxrejects option of otutab to a higher value, say -maxrejects 1000.

Tight OTUs
Another cause of (E1) is tight OTUs, i.e. OTUs that have >97% identity. This applies only to cluster_otus, not to denoising (unoise3). See tight OTUs for how to identify this problem.

Check your pipeline
You can't use the same FASTA file as input to OTU clustering (cluster_otus or unoise3) and to otutab because otutab requires sample labels in the sequence labels, so a given unique sequence usually appears many times, while the input to OTU clustering requires exactly one copy of each unique sequence with a size annotation. This adds a complication to the analysis pipeline which could be prone to mistakes, so if none of the above checks identified the problem, then you should make sure that the set of sequences used as input to otutab has all the sequences used to make the OTU sequences. In fact, the input to otutab should generally contain more sequences because I usually recommend using unfitered reads, including singletons, to make the OTU table, while the OTU sequences should be generated from filtered reads with singletons discarded.