Home Software Services About Contact usearch manual
global trimming
See also
Read quality filtering

  Choosing FASTQ filter parameters
  Global trimming for ITS

Global trimming addresses a problem that occurs in next-generation amplicon sequencing. The issue is related to terminal gaps in cluster alignments. I'll use 16S as an example since this is the most common application where it arises, but similar considerations apply for most amplicon sequencing applications. Recommendations are summarized in the table, with explanation below.

Reads should be globally alignable with NO terminal gaps
To get the best results from clustering, amplicon reads should be globally alignable, with no terminal gaps in pair-wise alignments of more closely-related sequences. In particular, the cluster_otus command considers all gaps to be differences, including terminal gaps. It is therefore critically important to trim reads appropriately. This is true even if the amplicons have large variations in length due to the biology of the gene or region, e.g. in the case of ITS (see global trimming for ITS amplicon reads).

Assessing read quality
See choosing FASTQ filter parameters for discussion of how to determine the appropriate parameters for global trimming for your reads.

Reads   Recommendation
Full coverage
 (reads include forward and backward primers)
  The best approach here depends on read quality. If the read quality is good enough, you can keep full-length amplicons, trimming reads to the start or end of the second primer (trimming is only needed if there are additional bases beyond the primer, e.g. adapter sequence). If the read quality is too low towards the end of the read, then you can trim to a fixed length as for partial coverage reads.
Partial coverage
 (read length shorter than amplicon length)
  Trim reads to a fixed length. You can use fastx_truncate or fastq_filter for this.

Typical 16S reads are derived from amplicon sequences. Amplicons are obtained by PCR from a pair of primers. It is important to consider whether the reads cover full or partial amplicons. Full coverage is typically obtained from overlapping paired reads.

In both cases, read lengths vary. With full coverage reads, lengths vary primarily because amplicon lengths vary (due to hypervariable regions in the gene). Minor variations can also occur due to indel errors in the reads (common in pyrosequencing reads due to homopolymers, but very rare with Illumina). With partial coverage reads, lengths vary primarily due to quality trimming. Since quality tends to fall towards the end of the reads, the last bases tend to be less reliable. This can produce an alignment with unreliable bases towards the end, as shown in the figure below.