OTU QC: All OTUs should appear in the OTU table
for OTU sequences
All OTU sequences should be found in your OTU table. Here is how to analyze
the problem if some are missing.
The three main explanations are:
E1. Some reads matches more than one OTU at 97%, in which case only one of
these OTUs may appear in the OTU table.
E2. Some OTUs have matches in the reads, but these matches are not found due
to false negatives in the search performed by otutab.
E3. There is a mistake in your pipeline design causing some reads to be
included when making the OTUs but excluded when making the OTU table.
Identify the missing OTU sequences
Find the labels of
the missing OTUs and make a FASTA file missing.fa with the sequences. You
can do this by cutting the labels out of the OTU table and grepping the
labels from the FASTA file with the OTU sequences, then using the Linux uniq
command to find labels which appear only in the FASTA file. Finally, use the
fastx_getseqs command to extract those
sequences. For example,
cut -f1 otutab.txt | grep -v "^#" > table_labels.txt
grep "^>" otus.fa |
sed "-es/>//" > seq_labels.txt
sort seq_labels.txt table_labels.txt
table_labels.txt | uniq -u > missing_labels.txt
otus.fa -labels missing_labels.txt -fastaout missing.fa
Strand duplicates and offsets
causes of (E1) are strand duplicates and
offsets. You can check for these problems
by aligning the missing OTUs to the original OTUs with the missing OTUs
removed. If there are matches, then this is most likely due to either strand
duplicates or offsets. This will be clear from the alignments: a strand
duplicate will align on the negative strand and an offset will align with
terminal gaps, which can be seen by noting that the start of the alignment
in one of the sequences is not at position 1. To make the database with the
not-missing OTUs, you can use the Linux uniq command again, as follows.
sort missing_labels.txt missing_labels.txt seq_labels.txt | uniq -u >
usearch -fastx_getseqs otus.fa -labels
notmissing_labels.txt -fastaout notmissing.fa
Then align the missing OTUs to the not-missing OTUs
and examine the alignments:
usearch -usearch_global missing.fa
-db -notmissing.fa -strand both -id 0.97 \
-uc missnot.uc -alnout
False negatives by otutab
If only a few OTUs are
missing, say one or two, then this may be due to (E2), i.e. a problem with
the otutab command. It uses heuristics to optimize speed, and in rare cases
it may fail to find a match between a query sequence and an OTU sequence.
You can check this by generating an OTU table using missing.fa instead of
otus.fa. If you now get non-zero counts for the missing OTUs, then one
possible explanation is false negatives. It may also be due to offsets or
strand duplicates as discussed below. If the problem is due to search
heuristics, then you should be able to fix the problem by setting the
-maxrejects option of otutab to a higher value, say -maxrejects 1000.
Another cause of (E1) is tight OTUs, i.e.
OTUs that have >97% identity. This applies only to
cluster_otus, not to denoising (unoise3).
See tight OTUs for how to identify this
Check your pipeline
You can't use the same FASTA file as
input to OTU clustering (cluster_otus or
unoise3) and to
otutab because otutab requires sample labels in the sequence labels, so
a given unique sequence usually appears many times, while the input to OTU
clustering requires exactly one copy of each unique sequence with a size
annotation. This adds a complication to the analysis pipeline which could be
prone to mistakes, so if none of the above checks identified the problem,
then you should make sure that the set of sequences used as input to otutab
has all the sequences used to make the OTU sequences. In fact, the input to
otutab should generally contain more sequences because I usually recommend
using unfitered reads, including singletons, to make the OTU table, while
the OTU sequences should be generated from filtered reads with singletons