consensus sequences
Home Software Services About Contact     
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.



Consensus sequences

See also
  consout option

If reads are approximately globally alignable to one biological sequence, then a multiple alignment of a biological sequence to its reads will look something like this. Read errors are highlighted.


The biological sequence can be estimated as the consensus sequence derived from the multiple alignment. In each column of the alignment, the most common letter is taken. If the column contains a gap, the column is discarded. In this example, the biological sequence is recovered correctly. In general, there might be some remaining errors but we expect the consensus sequence to be closer than the longest read or a randomly chosen read from the cluster.

Denoising is better!
For amplicon reads such as 16S and ITS tags, the denoised sequences generated by the unoise3 command will be much better predictions of biological sequences!

OTUs are better
For amplicon reads such as 16S and ITS tags, the centroid sequences generated by cluster_otus will be better predictions of biological sequences. I do not recommend using consensus sequences as OTU representatives.

Limitations of consensus sequences
The multiple alignment constructed by USEARCH is made using method that is designed to be as fast as possible with reasonable accuracy. The alignments, which can be reviewed by using the masout option. may be less accurate than popular multiple alignment programs like MUSCLE, especially at lower sequence identities. In USEARCH, consensus sequences are most appropriate high identities (say, 99%) when the alignments contain few gaps. At lower identities, the accuracy of the multiple alignment will tend to degrade, giving lower quality consensus sequences.