OTU QC: sequencing and PCR error
for OTU sequences
Almost certainly, some of your OTUs will be spurious due to sequence errors
due to amplification (PCR substitution errors and chimeras) and sequencing.
Unfortunately, these are very difficult to identify, even in control
samples, which is why I strongly recommend expected
error filtering and discarding singletons.
The only reasonably reliable way to test for spurious OTUs due to sequence
errors is to use a control sample with known sequences, i.e. a single strain
or a mock community. If you have such a control sample, then you will
probably find that the reads for the controls have spurious OTUs due to
cross-talk, so you should try to distinguish
these from spurious OTUs due to sequence errors. You can do this by using
the uncross command and manually reviewing
the predictions, or by aligning OTUs to the reference sequences in your
control sample and examining those which match, but not exactly, say in the
range 95% to 99% identity. These OTUs are quite likely explained by sequence
errors, but unfortunately other explanations are possible. Typical mock
communities have strains which are common human pathogens, so if your "real"
samples are from human, they may contain closely species (this often happens
in data that I analyze in my own work, so this is not as paranoid as it
might sound). In that case, an OTU with some differences compared with the
reference sequence might be the correct sequence of a strain in the other
samples. This can be tested to some extent by making OTUs from the control
samples only, keeping in mind that this will not eliminate cross-talk. Also,
it is impossible to distinguish chimeras from sequence errors when the
number of differences is small. Bottom line, with some effort you can
probably get a
good sense of which OTUs in your mock samples have sequence errors, but you
can't be sure.
If it seems likely that your controls samples have an
unacceptable number of OTUs with sequence errors, then you can adjust the
pipeline by increasing the stringency of the quality filtering. There are
two ways to do this: reduce the expected error
filtering threshold, and increase the unique sequence abundance
threshold to discard sequences with abundance two or higher instead of just
singletons. With some trial and error, you can tune these parameters by
looking at: (1) the accuracy of the control OTUs as described above, and (2)
by measuring coverage. If coverage drops too
far, you may be filtering too stringently.