Read quality filtering
Home Software Services About Contact     
 
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

24-Nov-2016
UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.

 

USEARCH v11

Read quality filtering

See also
 
Quality scores
  Expected errors
  Average Q is a bad idea!
  Global trimming
  Choosing FASTQ filter parameters 
 
Raw reads generated by a next-generation sequencing machine such as 454 or Illumina have predicted error probabilities for each base indicated by quality (Q) scores. In many applications it is important to filter reads to reduce the number of errors, especially in marker gene sequencing experiments such as 16S or ITS where it is very challenging to distinguish true biological sequences and between-sample variations from sequencing error and PCR artifacts (chimeras and point mutations during amplification).

In USEARCH, quality filtering is done with the fastq_filter command. I strongly recommend using expected error filtering.

You can use fastx_learn to estimate the error rate after filtering.

There is an important difference between Q scores in pyrosequencing reads from 454 and Illumina reads. In effect, 454 ignores the possibility of substitution errors and Illumina ignores indels. With 454, the Q score is the estimated probability that the length of the current homopolymer is wrong, and with Illumina the Q score is the probability that the base call is wrong. In the case of Illumina, this is reasonable because indel errors are very rare. But with 454, substitution errors are quite common, occurring with comparable frequency to homopolymer errors. This means that 454 Q scores are not as predictive of read errors as Illumina Q scores.