Expected errors predicted by Phred (Q) scores
Average Q is a bad idea!
Setting an e.e. threshold for long reads
Edgar & Flyvbjerg 2014 paper on expected errors and
Next-generation reads specify a Phred score
for each base, also known as a Quality or Q score. The Q score is an integer,
typically in the range 2 to 40. Q indicates the probability that the base call
is incorrect (P_e). For example, Q=2 means that the error probability is 63%, so
the machine is reporting that the base is more likely to be wrong than right,
while Q=20 corresponds to an error probability of 1%.
Expected number of errors
Assume that the Phred scores are good, in other words assume that the
machine makes good estimates of the error probabilities. Now we can measure the
overall quality of a read by calculating the expected number of errors in
the read. The expected
value of a quantity under a probability distribution is a technical term in
statistics, so let me explain what that means in this case.
Take the error probabilities according to the Q scores in a given read, and
imagine a very large number of reads containing errors that occur with those
probabilities. For example, if the first base in the read has Q=20, then 1% of
the reads in our collection will have errors in that position. Some of the reads
will have no errors, some will have one error, and so on. The Q scores can't
tell us exactly how many errors there are in a given read. However, the Q scores
can tell us what the average number of errors would be in that large
collection, and this is the definition of the expected number of errors. Because
we're taking an average, the value is real rather than integer, and can be less
Calculating the expected number of errors and most probable number of errors
It turns out that it is easy to calculate the expected number of errors: it
is the sum of the error probabilities. The most probable number of errors (E*) is
also easy to calculate. First calculate E = expected errors = sum P_e. Then
round down to the nearest integer, and this is the most probable number of
errors. In symbols: E* = floor(E). For
example, if E < 1, then the most probable number of errors is zero. Proofs
of these results are given in
Edgar & Flybjerg (2014).
Quality filtering by expected errors
Small E means high quality, large E means low quality. We can therefore
filter low quality reads by setting a maximum number of expected errors (E_max)
and discarding reads with E > E_max. This is done by running the
Choosing the maximum expected error threshold
A natural choice is E_max = 1, because the most probable number of errors of
a filtered read is zero. Though of course, we expect some of them to have one or more
errors (this can't be entirely avoided, however stringent the filter, because Q
scores are probabilities, not certainties). If you want to filter more
stringently, you might choose something like E_max = 0.5 or E_max = 0.25.
Overlapping paired reads
If you have paired reads that overlap, then these can be exploited to
improve the error filtering. The paired read merger (assembler) should report
re-calculated Phred scores in the overlapping region to reflect that we have two
independent observations of the same bases. If the forward and reverse read
agree on a base, then our confidence that the base is correct should increase.
The Q score should be higher, corresponding to a lower error probability.
Conversely, if the base calls disagree, then the re-calculated Phred score
should be lower than both the original base calls. The Q score in the merged
read (contig) should be calculated as the
probability according to
Bayes theorem where
the P_e's from the forward and reverse read are the
Quality filtering by maximum expected errors should be performed as a
post-processing step after reads have been merged in order to take advantage of
the re-calculated Phred scores, which should be better predictions of the true
It is important to use the USEARCH
fastq_mergepairs command for read merging because according to my tests,
other read mergers do not correctly calculate the posterior Q scores, including
PANDAseq, COPE, PEAR, fastq-join and FLASH. Some of these mergers also have high
rates of false positive and incorrect merges, including especially PANDAseq and
the make.contigs() command in mothur.
OK, the concept is nice, but how well does this work in practice?
For Illumina reads, this approach works very well, because the Illumina Q
scores are pretty good (much better than 454). The accuracy
of the Q scores varies with the sequencing machine, base-calling software and
probably other factors such as reagents. Amplicon sequencing tends to have less
accurate Q scores than shotgun, so you can't assume that results on a PhiX
library will be similar to results for amplicon reads. The Q scores may systematically under-
or over-estimate the error probabilities in different runs. Errors do
have some tendency to correlate, i.e. to cluster in certain reads, so they're
not perfectly independent as assumed by the theory. Also, Q scores can only
predict sequencer error per se, they cannot predict errors introduced before
sequencing, e.g. during PCR. But despite these caveats, the approach works well
in practice, better than anything else I've tested.
See this FAQ: Should I increase the expected error
threshold for long reads?