CD-HIT misalignment due to banding

 
<< CD-HIT analysis
<< Comparing USEARCH and CD-HIT
 
Below are two alignments of a pair of 16S reads (in FASTA format at bottom of this page). The top alignment is by CD-HIT and the lower alignment is by USEARCH. There is a highly conserved region at the end of the reads that is grossly misaligned by CD-HIT (red letters, highlighted in yellow below). The reason for the misalignment is the banded dynamic programming algorithm used by CD-HIT. If the difference in length in a segment between two conserved regions exceeds the band width, one or both of the conserved regions will be misaligned. If you're interested in reproducing these results, see here for instructions on how to view CD-HIT alignments.

CD-HIT banding causes more errors than USEARCH banding
USEARCH also uses banding by default, but the strategy is quite different, and the --band option of USEARCH is not equivalent to the -b option of CD-HIT. In CD-HIT, the band spans the entire alignment. In USEARCH banding is used only for regions between HSPs, and the band width is set dynamically according to the length difference of the aligned regions. The --band option of USEARCH sets a minimum width of the band, while the CD-HIT -b option sets the maximum width of the band. USEARCH alignments are therefore much less prone to banding artifacts.

USEARCH options for assessing and tuning banding
USEARCH alignment heuristics, including banding, can be disabled by using --nofastalign, which enables comparison of alignments with and without heuristics. This enables USEARCH users to to check for artifacts and tune speed heuristics to trade alignment quality for speed.

>M31Mout_50786
CTGGGCCGTGTCTCAGTCCCAGTGTGGCTGGTCATCCTCTCAGACCAGCTAGAGATCGTCGGCTTGGTGAGCCTTTACCT
CACCAACTACCTAATCCCACTTGGGCTCATCCTATGGCATGTGGCCCGAAGGTCCCACACTTTCATCTTCCGTACGTAAC
TTACCGTACCGGGTACGGTTAAGTTACGTACCTAACGTTTACCCGGTTTACCCGGTTTAACGTTTACCCCCTTCCCCCCT
ACCCTAAAGTAACTACGTAAGTTACCCTTAACCCGAACGACTTAA
>M22Fcsp_243936
CTGGGCCGTGTCTCAGTCCCAGTGTGGCTGGTCATCCTCTCAGACCAGCTAGAGATCGTCGGCTTGGTGAGCCTTTACCT
CACCAACTACCTAATCCCACTTGGGCTCATCCTATGGCATGTGGCCCGAAGGTCCCACACTTTCATCTCTCGATTCTACG
CGGTATTAACTACTACTTACCGTTTACCGGTTACGTTTACCCCTTCCCCCTACCTAATAACGTACGTAAGTTACCTTAAC
CCGAACGACTTAA