FASTQ reads can be
filtered to discard reads with lower
quality as predicted by the Phred scores.
USEARCH provides a maximum expected error
filter which uses a better measure of base call
accuracy compared with average Q or minimum Q score filters. Quality
filtering is implemented in the fastq_filter
command, which offers a rich set
FASTQ to FASTA conversion
The fastq_filter command can generate
output in FASTQ and/or FASTA format. If no quality filtering parameters are
specified, it performs a "raw" conversion of FASTQ to FASTA.
Paired read overlapping
Paired reads that overlap can be "merged" or "assembled" by aligning the forward
and reverse reads to give a single FASTQ or FASTA record for each pair. This is
implemented in the fastq_mergepairs
command. Phred (quality, Q) scores for the merged pair are calculated using
Bayesian statistics and are reported in the merged FASTQ record. If the forward
and reverse reads agree on a base call, the Q score is increased; if they
disagree, the Q score is reduced.
Dereplication removes identical
sequences, leaving one copy of each unique sequence. With very large read
depths, this can significantly reduce the data size and cost of downstream
processing, especially with amplicon reads. Dereplication is implemented in the
Chimeric sequence filtering
Amplicon reads contain chimeric
sequences due to PCR artifacts. The UCHIME
algorithm is a high-throughput, high-accuracy chimera filter. UCHIME is
implemented in the uchime_ref and
uchime_denovo commands. For
OTU clustering, the
cluster_otus command includes a highly
sensitive chimera filter based on the
UPARSE-REF algorithm, which has better sensitivity than UCHIME.
FASTQ file statistics
The fastq_stats command generates
statistics on read quality and length. The
fastq_chars command generates statistics on the ASCII characters used to
represent Q scores, which can be helpful when trying to determine the
format of a FASTQ file.