Read quality filtering
FASTQ format options
If possible, parameters for the fastq_filter command should be chosen manually for each sequencing run by examining the distribution of read length and Phred scores by position in the read, as these characteristics can vary considerably and can have a large impact on downstream analysis. The report given by fastq_eestats2 can be a useful starting point for this exercise. Parameters are chosen to balance conflicting objectives summarized in the table below.
|Keep as many reads as possible.||This is achieved by specifying a short length for truncation (fastq_trunclen) and a low quality threshold (fastq_truncqual) or high expected error threshold (fastq_maxee). However, shorter lengths reduce phylogenetic discrimination, and lower quality / higher expected errors will tend to increase the error rate and can lead to problems such as spurious OTUs.
|Keep as many read positions as possible.||More bases means better phylogenetic discrimination. However, quality tends to drop towards the end of an unpaired read, so keeping more bases may increase the number of errors. With paired reads that overlap, this usually isn't a problem if you merge the reads using fastq_mergpairs before filtering.
|Reduce the number of read errors.||Errors are reduced by (1) truncating the read at a shorter length (if unpaired), and/or (2) by having a more stringent quality threshold (higher Q or lower expected errors). However, these will reduce the total number of bases available for downstream analysis, which may reduce sensitivity as explained above.
Filter by minimum or average quality score or maximum expected errors?
In the UPARSE paper, I used a minimum quality score (fastq_filter command with the fastq_truncqual option) to show that a single pipeline with a single set of parameters gave high accuracy with enormously different input data, from two million Illumina paired and unpaired reads (forward or reverse) to a few thousand 454 reads per sample. Given the wide range of input data used in the UPARSE paper, I could not find a single value of -fastq_maxee that worked well with all sets of reads, and I did not want to give the appearance of tuning (or over-tuning) to the mock community datasets. However, in practice I would usually recommend using an expected error filter as I have found this to be a better predictor of read error rates in most cases.
Paired reads with overlap
If you have paired reads with an overlap that is long enough to merge them using fastq_mergepairs, then it is usually best to merge first, then quality filter with fastq_filter using a maximum expected errors threshold with no length truncation.
Paired reads with no overlap
If there is no overlap, or the overlap is too short to obtain reliable alignments, then it is usually best to discard the reverse reads. Typically the reverse reads have lower quality, and I do not believe the information in the reverse read can be used effectively (assuming these are amplicon reads). So I recommend just using the forward reads, i.e. treating them as unpaired (see next).
Unpaired reads that cover the full amplicon length
If you have unpaired reads that cover the full length of the amplicons, i.e. include both primers, then you have two choices: you can truncate at the second primer (which usually means keeping the full-length reads, unless they extend further into a non-biological sequence such as an adapter), or you can choose a shorter length. It may be better to choose a shorter length if the read quality deteriorates towards the end.
Unpaired reads that cover partial amplicons
If you have unpaired amplicon reads that do not extend to the second primer, or have low quality towards the end of the read, then you should truncate to a fixed length before performing OTU clustering (see global trimming for an explanation).
The simplest way to choose parameters is by trial and error. Set the maximum expected error parameter (fastq_maxee) to a few reasonable values (say, 0.25, 0.5 and 1), and similarly for the truncation length (say, 50%, 75%, 100% and 125% of the median read length), and measure how many reads are passed by the filter. The number of reads in a FASTQ file can be determined by using wc -l filename to determine the number of lines in the file; divide by 4 to get the number of reads. The report generated by fastq_stats can be helpful in guessing suitable parameters.
Example: Analysis of 454 reads
The figure below shows an analysis of 454 reads from the UPARSE paper (Supplementary Note 3). The average number of expected errors over all reads if truncated at each position was calculated from the Q scores (red line, right-hand y axis) using the fastq_stats command. Four different quality filters are considered with different quality score thresholds (Qmin, corresponding to one less than the -fastq_truncqual option of fastq_filter). The fraction of reads passing each filter if truncated each position is shown (left-hand y axis). The truncation length L=250 and Qmin=16 values (black dot) were selected as a compromise between stringent quality filtering to suppress errors (high Q), keeping as many reads as possible to increase sensitivity to low-abundance sequences (small L and low Q), keeping as many positions as possible to increase phylogenetic discrimination (large L), truncating in order to discard lower-quality regions towards the end of the reads (small L).