Uclust release notes v1.1

Cluster, align and search millions of sequences in minutes.

Release notes for v1.1.

WARNING
Please note that version 1.1 is not backwards compatible with v1.0. I apologize for any inconvenience -- I understand the importance of backwards compatibility, and I usually strive to maintain it. However, I believe the new design is significantly better, and I chose to bite the bullet now while there are relatively few users. The major changes are as follows.

The .uc file format has changed
The tab-separated .uc file format now has one more field that gives the label of the target sequence (i.e., the database or seed sequence that matched the query). In v1.0, the last field was the query label. In v1.1, the last two fields are now the query label and target label. Since tabs in the labels would cause problems parsing the file, tabs in labels are now replaced by spaces when FASTA files are read.

Command-line options have changed
Some options have been removed (e.g., --termgaps, --gapdiffs and --termgapdiffs), and some have been given better names, e.g. --exactalign is now --nofastalign and --mem is now --split.

Different definition of identity
In v1.0, the default definition of identity was (number of adventitial letter-letter columns)/(number of letter-letter columns). This sometimes gave undesirable results due to short regions of high identity in otherwise gappy alignments. In v1.1, identity is defined as (number of identical letter-letter columns)/(length of shorter sequence). I believe this will meet the needs of all current users without the need for --gapdiffs or --termgapdiffs options. If this causes problems for you, please let me know and I'll add appropriate options for your needs.

Improved control over alignment
Version 1.1 provides a rich model of gap penalties which allows fine control over the style of alignment that bests models your data. You can specify different open and extend penalties for internal and terminal gaps, for left- and right-end gaps, and different penalties for query vs. target sequence. This is explained in detail in the manual.

Improved alignment quality
In some applications, the heuristics used to achieve very fast alignment speeds (HSPs and banding) could sometimes give poor-quality alignments by finding spurious HSPs and / or by missing regions of high similarity due to long indels. Version 1.1 has improved alignment quality by increased stringency in HSP identification and increased band width. This may sometimes result in slower execution times; you can adjust parameters to achieve higher speeds if needed. A new feature is provided that automatically compares all alignments made by fast heuristics to an "exact" Needleman-Wunsch alignment. This allows you to evaluate the speed / sensitivity trade-off on typical data for your applications.