If reads are approximately
globally alignable to one biological sequence, then a multiple alignment of a
biological sequence to its reads will look something like this. Read errors are
sequence can be estimated as the consensus sequence derived from the multiple
alignment. In each column of the alignment, the most common letter is taken. If
the column contains a gap, the column is discarded. In this example, the
biological sequence is recovered correctly. In general, there might be some
remaining errors but we expect the consensus sequence to be closer than the
longest read or a randomly chosen read from the cluster.
OTUs are better
For amplicon reads such as 16S and ITS tags, the centroid sequences
generated by cluster_otus will be better
predictions of biological sequences. I do not recommend using consensus
sequences as OTU representatives.
Limitations of consensus sequences
The multiple alignment constructed by USEARCH is made using method that is
designed to be as fast as possible with reasonable accuracy. The alignments,
which can be reviewed by using the masout option.
may be less accurate than popular multiple alignment programs like
MUSCLE, especially at lower sequence
identities. In USEARCH, consensus sequences are most appropriate high identities
(say, 99%) when the alignments contain few gaps. At lower identities, the
accuracy of the multiple alignment will tend to degrade, giving lower quality