abundance estimation

Abundance estimation is the process of determining how common a biological sequence is in a sample. This is most often using next-generation sequencing reads, which introduce the problem of amplification and sequencing errors.

There are two important sources of bias: PCR amplification bias and gene density bias.

With short reads that only partially cover amplicons, it is important to trim the reads intelligently before making an abundance estimate. See global trimming.

Use the ‑sizeout option to create size annotations.

Abundance estimation by dereplication
If we assume that sequences are error-free, then the most straightforward way to estimate abundance is to use dereplication. This gives the correct answer assuming (1) there are no PCR errors, there are no PCR biases (some sequences are amplified more than others), and (3) there are no sequencing errors.

These are questionable assumptions, but even if there are sequencing and/or amplification errors, using dereplication is often a good strategy. There isn't much we can do about PCR bias, because as far as I know, it isn't possible to predict PCR bias from primary sequence.

With errors due to sequencing and amplification (imperfect copying during PCR), some sequences fail to match, creating small spurious clusters (mostly singletons) and causing abundance of the true biological sequence to be underestimated. However, this may not be very harmful in practice for a couple of reasons.

1. The underestimate due to errors applies consistently to all clusters, and it is usually only ratios between abundances that are significant, not absolute values. The ratios between abundances of larger clusters should be reasonably stable against errors, and it is usually these abundances that are the most important.

2. The small, spurious clusters due to errors may be ignored or merged into the correct cluster in a post-processing step. In the case of uchime_denovo, the most important aspect of abundance estimation is that potential parents have higher abundances than their chimeras. Amplicons that are sufficiently abundant to be parents will usually also have enough error-free reads to give a large cluster. In the case of UCLUST, sequences with errors will tend to be merged into their correct cluster.

Abundance estimation by clustering
An alternative to dereplication is to use clustering at a threshold <100% in order to allow some errors in the input sequences. This raises the question of which identity threshold to use. This depends on a number of factors, including the error characteristics of the sequencing technology and which algorithms will be used for downstream processing. The best way to determine the optimal threshold is to create a benchmark test using simulated reads. In practice, a high threshold such as 99% should usually work well enough.

Example using dereplication

usearch -derep_prefix reads.fasta -output uniques.fasta -sizeoout

Example using clustering

usearch -cluster_fast reads.fasta -id 0.99 -consout cons.fasta -sizeoout