Read quality filtering
Choosing FASTQ filter parameters
Global trimming for ITS
Global trimming addresses a problem that occurs in next-generation
The issue is related to terminal gaps in cluster alignments. I'll use 16S as an
example since this is the most common application where it arises, but similar
considerations apply for most amplicon sequencing applications. Recommendations are summarized in
the table, with explanation below.
Reads should be globally alignable with NO
To get the best results from clustering, amplicon reads should be globally
alignable, with no terminal gaps in pair-wise alignments of more
closely-related sequences. In particular, the
cluster_otus command considers all gaps to be differences, including
terminal gaps. It is therefore critically important to trim reads appropriately.
This is true even if the amplicons have large variations in length due to the
biology of the gene or region, e.g. in the case of ITS (see
global trimming for ITS amplicon reads).
Assessing read quality
See choosing FASTQ filter parameters
for discussion of how to determine the appropriate parameters for global
trimming for your reads.
(reads include forward and backward primers)
The best approach here depends on read quality. If the
read quality is good enough, you can keep full-length amplicons, trimming reads
to the start or end of the second primer (trimming is only needed if there are
additional bases beyond the primer, e.g. adapter sequence). If the read quality
is too low towards the end of the read, then you can trim to a fixed length as
for partial coverage reads.
(read length shorter than amplicon length)
Trim reads to a fixed
length. You can use fastx_truncate or
Typical 16S reads are derived from amplicon sequences.
Amplicons are obtained by PCR from a pair of primers. It is important to
consider whether the reads cover full or partial amplicons. Full coverage is
typically obtained from overlapping paired reads.
In both cases, read lengths vary. With
full coverage reads, lengths vary primarily because amplicon lengths vary (due
to hypervariable regions in the gene). Minor variations can also occur due to
indel errors in the reads (common in pyrosequencing reads due to homopolymers,
but very rare with Illumina). With partial coverage reads, lengths vary
primarily due to quality trimming. Since quality tends to fall towards the end
of the reads, the last bases tend to be less reliable. This can produce an
alignment with unreliable bases towards the end, as shown in the figure below.