Commands > Trees and distance matrixes
Generate a distance matrix from
an input file in FASTA or FASTQ format. See also
The distance matrix filename is specified by the -distmxout
The matrix format is specified by the -format option,
which can be tabbed_pairs (default), square, phylip_square or
phylip_lower_triangle. See distance matrix for details
for these file formats.
Distance values are in the range zero (identical
sequences) to one (no similarity) corresponding to the range 100% identity to 0%
Clusters can be generated from a distance matrix with the
By default, pairs are prioritized by the U-sort heuristic
as used in the USEARCH algorithm. This means
that pairs are considered in decreasing order of the number of unique words (U)
they have in common. Since U correlates with identity,
this means that pairs are considered in approximately increasing order of
distance. U-sorting can be turned off using the -nousort option. U-sorting plus
additional heuristics used to find HSPs can all be disabled using the -distmx_brute
option, which forces all pairs of sequences to be aligned. This is guaranteed to
give a complete matrix, but can be much slower for large datasets. Note that
low-identity pairs generally have little effect on clustering or tree topology,
so the additional "accuracy" of a brute force calculation often has little
The -sparsemx_minid option gives the minimum identity
which should be written to a matrix in tabbed_pairs format.
An identity threshold for terminating the calculation can
be specified using the termid option, which is in
the range 0.0 to 1.0, where 1.0 means identical sequences (100% sequence id).
This is a speed optimization that saves time by skipping alignments of
low-identity pairs. If a pair is encountered with fractional identity < termid,
the calculation is stopped. Because U-sorted order does not correlate perfectly
with identity, you should set termid somewhat lower than the minimum identity
that you care about. For example, if you want all pairs with >80% id to appear
in the matrix, then you might set -termid 0.7. Tests on small datasets can be
used to tune -termid to a reasonable value. By default, termid is set to 0 and
the calculation continues for all pairs that have at least one word in common.
The word length is set by the wordlength option.
usearch -calc_distmx seqs.fa -distmxout mx.txt -sparsemx_minid
0.8 -termid 0.7
usearch -calc_distmx seqs.fa -distmxout dist.tree -format