Mapping reads to OTUs
Home Software Services About Contact     
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.



Mapping reads to OTUs

See also
  Defining and interpreting OTUs
  Interpreting counts and frequencies in OTU tables 

The otutab command maps a read to an OTU by finding the OTU sequence with highest identity above a given threshold (usually 97%). If there is a tie, the tie is broken by choosing the first OTU in database file order.

A read sequence can match two or more OTUs with >=97% identity. It has been suggested (Ye et al. 2015) that the read should be assigned to the OTU with highest abundance rather than highest identity. I disagree, because high identity is a better signal that the sequence is from the same species.

Suppose a sequence (S) matches two OTU sequences: A with 97% identity and B with 98% identity. The unique sequence A has abundance 1,000 and B has abundance 100. Should we assign S to A or B? A has higher abundance but lower identity, vice versa for B.

There are three possible reasons why S does not exactly match an OTU sequence: 1. it is a correct biological sequence, 2. it has sequence errors due to PCR or sequencing, or 3. it is chimeric.

(1) S is a correct biological sequence
Here, either S is a paralog of A or B derived from the same genome, or S is from a different species.

(1a) If S is a paralog, we would prefer to assign it to the same OTU as the other paralog(s) from the species. This is more likely to be B because paralogs tend to have high identity. Paralogs in a given species almost always have higher identity to each other than to genes in another species. (Same argument applies to intra-species variation).

(1b) If S belongs to a different species, then we are lumping two species into the same OTU and there is no reason to prefer A or B.

Conclusion: if S is a correct biological sequence, it is better to choose the OTU with highest identity because it is the most likely to belong to the same species so we should assign S to B.

(2) S has sequencing errors.
Here, either S is a bad read of A or B, or S is a bad read of a correct biological sequence which is above the identity threshold so does not have its own OTU.

(2a) If we know that S is a bad read of an OTU sequence then we should again choose the highest identity match because this is much more likely to be the correct sequence. Suppose S is a bad read of A. Adding errors to A will probably reduce identity to both A and B. A bad read of A which has higher identity to B must have base call errors that reproduce letters in B by chance; this is very unlikely.

(2b) If S is a bad read of a biological sequence which is not A or B then this case is similar to (1) and we should therefore prefer the highest identity match.

Conclusion: if S has sequencer error we should assign it to the OTU with highest identity because this is much more likely to be the correct sequence, so we should assign S to B.

(3) S is an undetected chimera.
This scenario is less common than (1) or (2) because chimeras are rare as a fraction of the reads. If S is chimeric, there is no reason to prefer A or B.