calc_distmx command

Generate a distance matrix from an input file in FASTA or FASTQ format.

The distance matrix filename is specified by the -tabbedout option.

Distance values are in the range zero (identical sequences) to one (no similarity) corresponding to the range 100% identity to 0% identity.

Multithreading is supported.

Clusters can be generated from a distance matrix with the cluster_aggd command.

By default, pairs are prioritized by the U-sort heuristic as used in the USEARCH algorithm. This means that pairs are considered in decreasing order of the number of unique words (U) they have in common. Since U correlates with identity, this means that pairs are considered in approximately increasing order of distance. U-sorting can be turned off using the -nousort option. U-sorting plus additional heuristics used to find HSPs can all be disabled using the -distmx_brute option, which forces all pairs of sequences to be aligned. This is guaranteed to give a complete matrix, but can be much slower for large datasets. Note that low-identity pairs generally have little effect on clustering or tree topology, so the additional "accuracy" of a brute force calculation often has little biological value.

The -maxdist option gives the maximum distance which should be written when the output is in tabbed_pairs format. This can greatly reduce the size of an output file.

An identity threshold for terminating the calculation can be specified using the termdist option, which is in the range 0.0 to 1.0, where 0.0 means identical sequences (100% sequence id). This is a speed optimization that saves time by skipping alignments of low-identity pairs. If a pair is encountered with distance > maxdist, the calculation is stopped. Because U-sorted order does not correlate perfectly with identity, you should set termdist somewhat lower than the maximum distance that you care about. For example, if you want all pairs with >80% id to appear in the matrix, then you might set -maxdist 0.2 -termdist 0.3. Tests on small datasets can be used to tune -termdist to a reasonable value. By default, termdist is set to 1.0 and the calculation continues for all pairs that have at least one word in common. The word length is set by the wordlength option.

Examples

usearch -calc_distmx seqs.fa -tabbedout mx.txt -maxdist 0.2 -termdist 0.3

usearch -calc_distmx seqs.fa -tabbedout dist.tree -format phylip_lower_triangular