Defining unique sequence abundance
Defining abundance when sequence length varies
Calculating unique sequence abundance is problematic when reads of the same
template sequence vary in length, e.g. because reads are truncated when the
quality score drops below a threshold.
Consider two reads A and B where B is
shorter but otherwise identical to A. Here, abundance could be defined in
three different ways.
(1) There are two unique sequences A and B, each with abundance one.
(2) There is one unique sequence A with abundance two.
(3) There is one unique sequence B with abundance two.
All of these definitions have problems.
With (1), a given template sequence with high abundance in the amplicons
will typically have many different unique sequences with low abundances
because its reads are truncated to many different lengths.
With (2) the unmatched tail of A is considered to have the same abundance as
the prefix of A that is identical to B. The tail has no support from other
reads (it is effectively a singleton), but that information is lost and in
practice long reads with noisy tails are assigned high abundances.
With (3), the shortest sequence in a set is supported by longer sequences.
This is the least bad definition: if the abundance is high, the sequence is
likely to be correct. However, phylogenetically and phenotypically
informative bases may be lost, and the ambiguities inherent in comparing
sequences of different length must now be addressed by downstream algorithms
(e.g., denoising or OTU clustering). For example, if two unique sequences
differ in length by one base and have one substitution, should this count as
d=1 (just the substitution) or d=2 (substitution plus terminal gap)? If
large variations in length are allowed, then the phylogenetic and phenotypic
resolution of the sequences may vary substantially, degrading the
comparability of ZOTUs or OTUs to each other for calculating diversity,
predicting taxonomy and so on. These problems are avoided by ensuring that
reads of the same template sequence have the same length (global trimming,
implying that reads of the same template should be globally alignable,
though more distantly related sequences need not be).
Methods for global trimming
The simplest method for
global trimming is to truncate all reads to the same length. This is not
usually necessary with overlapping Illumina paired-end reads that have been
merged by a paired-read assembler. In this case, the merged sequence always
terminates at the reverse primer which guarantees that reads of the same
template will have the same length regardless of variations in amplicon
length between different species. If multiple primers were used which do not
bind to the same locus, then trimming is required to ensure that reads of
the same template amplified by different primers start and end at the same
position in the biological sequence. Primer-binding bases should be
discarded from the reads because PCR tends to induce substitutions at
mismatched positions; in most cases this is easily accomplished by
discarding a fixed number of bases (the primer lengths) from each end of the
sequence. There is no need to explicitly match the primer sequence in order
to trim it unless there are multiple primers binding to different loci.