Average Q is a bad idea!
The quality (Q) score of a base indicates the sequencing
machine's estimated probability that the base call is wrong. Consider a read
with a given set of Q scores, and suppose the machine's Q scores are accurate.
Now consider a large sample of reads with the same Q scores, then the expected
number of errors is the average number of errors per read that we would find in
that sample of reads. This is roughly equivalent to the most likely number of
errors in the read, though the number of expected errors is not always an integer and can be less than one.
Take a simple example: a read of length two with
quality scores Q3 and Q40, corresponding to error probabilities P=0.5 and
P=0.0001. The base with Q3 is much more likely to have an error than the base
with Q40 (0.5/0.0001 = 5,000 times more likely), so we can ignore the Q40 base
to a good approximation. Consider a large sample of reads with (Q3, Q40), then
approximately half of them will have an error (because of the P=0.5 from the Q2
base). We express this by saying that the expected number of errors in a read
with quality scores (Q3, Q40) is 0.5.
As this example shows, low Q scores (high error
probabilities) dominate expected errors, but this information is lost by
averaging if low Qs appear in a read with mostly high Q scores. This explains
why expected errors is a much better indicator of read accuracy than average Q.