Quality control for OTU sequences

Amplicon reads often contain artifacts which are not filtered by my recommended pipeline because they vary widely in different datasets and it would be difficult to account for all of them in a single set of commands. It is generally easier to identify them by manually analyzing the OTU sequences rather than the reads because of the much smaller size of the dataset. Of course, if you are going to repeatedly run a pipeline with reads obtained from similar libaries, it would make sense to modify the pipeline to filter the types of artifact you find.

Here, I describe qualilty control checks that I use in my own work with links to discussion and commands. If you encounter other artifacts in your data, please let me know and I will update this page.

See control samples for discussion of how to use controls to better understand your data.

Issue		Description
Alignments		Do the OTU sequences align well to a reference database for your gene?
Missing OTUs		Do all OTUs appear in the OTU table?
Coverage		How much of the data is explained by the OTUs?
Short contructs		Bad sequencing construct created by PCR
Strand duplicates		Sequences of both plus and minus strands
Offsets		Sequences start at different positions in the gene
Cross-talk		Reads assigned to the wrong sample.
Sequence error		Polymerase errors and bad base calls
Low complexity		Sequencer noise
PhiX		Unfiltered spike-in
Chimeras		Unfiltered PCR chimeras
Mistargeting		Primers amplify a different region
Contaminants		Self-explanatory
Primers		Primer-binding sequences should be stripped at the start of the pipeline
Tight OTUs		OTUs >97% identical