Keep as many reads as possible.
This is achieved by specifying a short length for
truncation (fastq_trunclen) and a low quality threshold (fastq_truncqual) or
high expected error threshold (fastq_maxee). However, shorter lengths reduce
phylogenetic discrimination, and lower quality
/ higher expected errors will tend to
increase the error rate and can lead to problems such as spurious OTUs.
Keep as many read positions as possible.
||More bases means
better phylogenetic discrimination. However, quality tends to drop towards the
end of an unpaired read, so keeping more bases may increase the number of
errors. With paired reads that overlap, this usually isn't a problem if you
merge the reads using fastq_mergpairs before
Reduce the number of read errors.
reduced by (1) truncating the read at a shorter length (if unpaired), and/or (2)
by having a more stringent quality threshold (higher Q or lower
expected errors). However, these will reduce
the total number of bases available for downstream analysis, which may reduce
sensitivity as explained above.
Filter by minimum quality score or
maximum expected errors?
In the UPARSE paper, I used a minimum quality score (fastq_filter
command with the fastq_truncqual option) to show that a single pipeline with a single set of
parameters gave high accuracy with enormously different input data, from two
million Illumina paired and unpaired reads (forward or reverse) to a few
thousand 454 reads per sample. Given the wide range of input data used in
the UPARSE paper, I could not find a single value of
‑fastq_maxee that worked well with all sets of
reads, and I did not want to give the appearance of tuning (or over-tuning)
to the mock community datasets. However, in practice I would usually
recommend using an expected error filter
as I have found this to be a better predictor of read error rates in most cases.
Paired reads with overlap
If you have paired reads with an overlap that is long enough to merge them
using fastq_mergepairs, then it is usually
best to merge first, then quality filter with
fastq_filter using a maximum expected errors threshold with no length
Paired reads with no overlap
If there is no overlap, or the overlap is too short to obtain reliable
alignments, then it is usually best to discard the reverse reads. Typically the
reverse reads have lower quality, and I do not believe the information in the
reverse read can be used effectively (assuming these are amplicon reads). So I
recommend just using the forward reads, i.e. treating them as unpaired (see
Unpaired reads that cover the full
If you have unpaired reads that cover the full length of the amplicons, i.e.
include both primers, then you have two choices: you can truncate at the second
primer (which usually means keeping the full-length reads, unless they extend
further into a non-biological sequence such as an adapter), or you can choose a
shorter length. It may be better to choose a shorter length if the read quality
deteriorates towards the end.
that cover partial amplicons
If you have unpaired amplicon reads that do not extend to the second primer,
or have low quality towards the end of the read, then you should truncate to a
fixed length before performing OTU clustering (see
global trimming for an explanation).
The simplest way to choose parameters is by trial and error. Set the maximum
expected error parameter (fastq_maxee) to a few reasonable values (say, 0.25,
0.5 and 1), and similarly for the truncation length (say, 50%, 75%, 100% and
125% of the median read length), and measure how many reads are passed by the
filter. The number of reads in a FASTQ file can be determined by using
wc -l filename
to determine the number of lines in the file; divide by 4 to get the number
of reads. The report generated by fastq_stats can
be helpful in guessing suitable parameters.