OTU QC: sequencing and PCR error

Almost certainly, some of your OTUs will be spurious due to sequence errors due to amplification (PCR substitution errors and chimeras) and sequencing. Unfortunately, these are very difficult to identify, even in control samples, which is why I strongly recommend expected error filtering and discarding singletons.

The only reasonably reliable way to test for spurious OTUs due to sequence errors is to use a control sample with known sequences, i.e. a single strain or a mock community. See control samples for discussion.

If you have a control sample, then you will probably find that the reads for the controls have spurious OTUs due to cross-talk, so you should try to distinguish these from spurious OTUs due to sequence errors. You can do this by using the uncross command and manually reviewing the predictions, or by aligning OTUs to the reference sequences in your control sample and examining those which match, but not exactly, say in the range 95% to 99% identity. These OTUs are quite likely explained by sequence errors, but unfortunately other explanations are possible. Typical mock communities have strains which are common human pathogens, so if your "real" samples are from human, they may contain closely species (this often happens in data that I analyze in my own work, so this is not as paranoid as it might sound). In that case, an OTU with some differences compared with the reference sequence might be the correct sequence of a strain in the other samples. This can be tested to some extent by making OTUs from the control samples only, keeping in mind that this will not eliminate cross-talk. Also, it is impossible to distinguish chimeras from sequence errors when the number of differences is small. Bottom line, with some effort you can probably get a good sense of which OTUs in your mock samples have sequence errors, but you can't be sure.

If it seems likely that your controls samples have an unacceptable number of OTUs with sequence errors, then you can adjust the pipeline by increasing the stringency of the quality filtering. There are two ways to do this: reduce the expected error filtering threshold, and increase the unique sequence abundance threshold to discard sequences with abundance two or higher instead of just singletons. With some trial and error, you can tune these parameters by looking at: (1) the accuracy of the control OTUs as described above, and (2) by measuring coverage. If coverage drops too far, you may be filtering too stringently.