global trimming

global trimming for ITS amplicon reads

ITS amplicons often have large variations in length due to the biology of the region -- some of the sequence evolves neutrally, and long indels are common. Even in this case, it is important to trim reads to a fixed length unless the reads are long enough to include the complete amplicon, i.e. all the bases between the amplification primers.

Biological differences due to indels are not lost by length trimming -- sequences that differ by an indel will have a correspondingly reduced sequence identity. In the case of ITS, the region has a high evolutionary rate, so sequences that differ by indels will also have many substitutions so in practice it is not necessary to correctly identify gaps in order to get a good estimate of identity. Generally, different species have ITS regions with identities <<97%, making clustering easier than the case of 16S, where biological differences are hard to distinguish from experimental error.

Length trimming avoids complications that would otherwise arise when attempting to compare reads that vary in length. These complications arise at three stages in the UPARSE pipeline: 1. dereplication, 2. OTU clustering and 3. mapping reads to OTUs.

Complications arise when two reads are identical or very similar over the length of the shorter read. The additional letters in the longer sequence make the identity of the sequence pair ambiguous (e.g., are they identical for the purposes of dereplication?), and also make it difficult to compare the identity of both sequences to the centroid sequence of an OTU. These problems are more serious than you would naively expect, and arise even when OTUs are constructed using a reference database rather than de novo clustering.

Calculating accurate abundances is important for identifying singletons, which should be discarded. If reads vary in length, then a shorter read might be interpreted as identical to a longer read which has several errors in the additional sequence; this can cause the longer read to have a spurious abundance > 1 and lead to a spurious OTU.