Home Software Services About Contact     
 
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

24-Nov-2016
UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.

 

USEARCH v11

UNOISE algorithm

See also
 
UNOISE paper
  Should I use UPARSE (97% OTUs) or UNOISE (denoising)?

The UNOISE algorithm performs error-correction (denoising) on amplicon reads. It is implemented in the unoise3 command.

The original UNOISE algorithm was briefly described in Edgar & Flyvbjerg (2015). An improved algorithm was described and validated in Edgar 2016. The implementation in unoise3 and uchime3_denovo is quite similar to UNOISE2 except for a change in parameters for chimera detection, which I believe greatly reduces the number of false positives over the original parameters described in the UNOISE2 paper that were  implemented in the earlier unoise2 and uchime2_denovo commands in usearch v9.

The algorithm is designed for Illumina reads, it does not work as well on 454, Ion Torrent or PacBio reads.

Correct biological sequences are recovered from the reads, resolving distinct sequences down to a single difference (often) or two or more differences (almost always).

Errors are corrected as follows:
  - Reads with sequencing and PCR point error are identified and removed.
  - Chimeras are removed.

Abundances are calculated after denoising by generating an OTU table using the otutab command.

Image

Schematic of the UNOISE2 denoising strategy (figure from the UNOISE2 paper).
 The left panel shows the neighborhood close to a high-abundance unique read sequence X, grouped by the number of sequence differences (d). Dots are unique sequences, the size of a dot indicates its abundance. Green dots are correct biological sequences; red dots have one or more errors. Neighbors with small numbers of differences and small abundance compared to X are predicted to be bad reads of X. The right panel shows the denoised amplicons. Here, X and b were correctly predicted, e is an error with anomalously high abundance that was wrongly predicted to be correct, f is an error that was correctly discarded but has an abundance almost high enough to be a false positive, and g is a low-abundance correct amplicon that was wrongly discarded. The abundances of b, e, and f are similar, illustrating the fundamental challenge in denoising: how to set an abundance threshold that distinguishes correct sequences from errors.