Read quality filtering
Average Q is a bad idea!
Choosing FASTQ filter parameters
Raw reads generated by a next-generation sequencing machine such as 454 or
Illumina have predicted error probabilities for each base indicated by
quality (Q) scores. In many applications it is important to filter reads to
reduce the number of errors, especially in marker gene sequencing experiments
such as 16S or ITS where it is very challenging to distinguish true biological
sequences and between-sample variations from sequencing error and PCR artifacts
(chimeras and point mutations during amplification).
In USEARCH, quality filtering is done with the
fastq_filter command. I strongly recommend
using expected error filtering.
You can use fastx_learn
to estimate the error rate after filtering.
There is an important difference between Q scores in
pyrosequencing reads from 454 and Illumina reads. In effect, 454 ignores the
possibility of substitution errors and Illumina ignores indels. With 454, the Q
score is the estimated probability that the length of the current homopolymer is
wrong, and with Illumina the Q score is the probability that the base call is
wrong. In the case of Illumina, this is reasonable because indel errors are very
rare. But with 454, substitution errors are quite common, occurring with
comparable frequency to homopolymer errors. This means that 454 Q scores are not
as predictive of read errors as Illumina Q scores.