Mapping reads to OTUs
Interpreting counts and frequencies in OTU tables
The otutab command maps a read to an OTU by
finding the OTU sequence with highest identity above a given threshold
(usually 97%). If there is a tie, the tie
is broken by choosing the first OTU in database file order.
A read sequence can match two or more OTUs with >=97% identity. It has been
suggested (Ye et
al. 2015) that the read should be assigned to the OTU with
highest abundance rather than highest identity. I disagree, because high identity is a better signal that the sequence is from the same
Suppose a sequence (S)
matches two OTU sequences: A with 97% identity and B with 98% identity. The
unique sequence A has abundance 1,000 and B has abundance 100. Should we
assign S to A or B? A has higher abundance but lower identity, vice versa
There are three possible reasons why S does not exactly match an OTU
sequence: 1. it is a correct biological sequence, 2. it has sequence errors
due to PCR or sequencing, or 3. it is chimeric.
(1) S is a correct biological sequence
Here, either S is a paralog of A
or B derived from the same genome, or S is from a different species.
(1a) If S is a paralog, we would prefer to assign it to the same OTU as the
other paralog(s) from the species. This is more likely to be B because
paralogs tend to have high identity. Paralogs in a given species almost
always have higher identity to each other than to genes in another species.
(Same argument applies to intra-species variation).
(1b) If S belongs to a different species, then we are lumping two species
into the same OTU and there is no reason to prefer A or B.
Conclusion: if S is a correct biological sequence, it is better to choose
the OTU with highest identity because it is the most likely to belong to the
same species so we should assign S to B.
(2) S has sequencing errors.
Here, either S is a bad read of A or B, or S
is a bad read of a correct biological sequence which is above the identity
threshold so does not have its own OTU.
(2a) If we know that S is a bad read of an OTU sequence then we should again
choose the highest identity match because this is much more likely to be the
correct sequence. Suppose S is a bad read of A. Adding errors to A will
probably reduce identity to both A and B. A bad read of A which has higher
identity to B must have base call errors that reproduce letters in B by
chance; this is very unlikely.
(2b) If S is a bad read of a biological sequence which is not A or B then
this case is similar to (1) and we should therefore prefer the highest
Conclusion: if S has sequencer error we should assign it to the OTU with
highest identity because this is much more likely to be the correct
sequence, so we should assign S to B.
(3) S is an undetected chimera.
This scenario is less common than (1) or
(2) because chimeras are rare as a fraction of the reads. If S is chimeric,
there is no reason to prefer A or B.