FAQ: How can I compare two sets of OTUs?
Quality control for OTUs
Suppose you want to compare two sets of OTUs, e.g. made by
two different pipelines (say, QIIME and UPARSE) or a single pipeline with two
different sets of parameters (say, with different filtering, length or abundance
thresholds). How can you do this?
Compare OTUs for control
If you have control samples such as a mock community or
single strain, then you can assess the accuracy of OTUs for the control samples samples.
Finding the common subset between two sets of OTUs
If you have two sets of OTUs X and Y, a simple question to ask is:
which OTUs are in (a) both X and Y, (b) in X only, and (c) in Y only. Those that
are in both X and Y are more likely to be correct. Those in X only or Y only are
certainly errors: they are either false negatives in one set or false positives
in the othe set. You can find these subsets using the
usearch_global command, like this:
usearch -usearch_global otusx.fa -db otusy.fa -id 0.97
-matched x_and_y.fa -notmatched x_only.fa
otusy.fa -db otusx.fa -id 0.97 -notmatched y_only.fa
If the OTUs sequences have different lengths, e.g. because you used
different global trimming lengths, then
the results are harder to intepret because it is possible for two longer
sequences to be <97% identical to each other but have identical prefixes C.
Therefore, you may have two long OTUs (A, B) identical to one short OTU (C),
and all three sequences could be correct.
Most chimeras are derived from the most abundant sequences in the
PCR reaction. As a check on which set of OTUs has more chimeras, take the
OTUs with highest abundance and use them as a reference database for
uchime2_ref using -mode specific. If you
have the original reads, then even better is to find the most abundant
unique sequences and use those as a reference, e.g. take the top 20
usearch -fastx_uniques reads.fq -fastaout top_uniques.fa -topn 20 -sizeout
Then check both sets of OTUs:
usearch -uchime2_ref otusx.fa -db top_uniques.fa -strand plus -uchimeout
-uchimealnout otusx.uchimealns -mode specific
usearch -uchime2_ref otusy.fa -db top_uniques.fa -strand plus -uchimeout
-uchimealnout otusy.uchimealns -mode specific
Now you can review the output to see if there is a big
difference between the number of chimeras in each set.
Searching a large database (SILVA)
Using a large search
database such as SILVA can give useful insights into the OTUs. Exact matches
(100% identity) are very likely to be correct sequences. If one set has more
exact matches, it is probably more sensitive, though keep in mind it may
also have more spurious OTUs due to chimeras and read errors. Hits with >97%
identity are more difficult to interpret: they could be correct, but could
have errors. An OTU with <3% errors is relatively harmless because 97% OTUs
are allowed to vary by this much. If an OTU has >3% errors then it is
harmful because the correct sequence is probably in a different OTU, so this
OTU is completely spurious and inflates the apparent diversity.
Unfortunately, even large databases such as SILVA have only a small subset
of extant species so it is difficult to interpret hits that are <100%.