Amplicon reads often contain artifacts which are not filtered by my recommended pipeline because they vary widely in different datasets and it would be difficult to account for all of them in a single set of commands. It is generally easier to identify them by manually analyzing the OTU sequences rather than the reads because of the much smaller size of the dataset. Of course, if you are going to repeatedly run a pipeline with reads obtained from similar libaries, it would make sense to modify the pipeline to filter the types of artifact you find.
Here, I describe qualilty control checks that I use in my own work with links to discussion and commands. If you encounter other artifacts in your data, please let me know and I will update this page.
See control samples for discussion of how to use controls to better understand your data.
|Alignments||Do the OTU sequences align well to a reference database for your gene?|
|Missing OTUs||Do all OTUs appear in the OTU table?|
|Coverage||How much of the data is explained by the OTUs?|
|Short contructs||Bad sequencing construct created by PCR|
|Strand duplicates||Sequences of both plus and minus strands|
|Offsets||Sequences start at different positions in the gene|
|Cross-talk||Reads assigned to the wrong sample.|
|Sequence error||Polymerase errors and bad base calls|
|Low complexity||Sequencer noise|
|Chimeras||Unfiltered PCR chimeras|
|Mistargeting||Primers amplify a different region|
|Primers||Primer-binding sequences should be stripped at the start of the pipeline|
|Tight OTUs||OTUs >97% identical|