Home Software Services About Contact usearch manual
Distance matrix file

See also
 
calc_distmx command

  cluster_aggd command

A distance matrix file contains pair-wise distances between a set of sequences.

A distance is specified as a fractional identity (the default) or a fractional difference which is 1.0 - (fractional identity). Fractional identity is more familiar to most people and more consistent with other USEARCH commands, but fractional difference behaves as a distance measure because it increases with distance (it is zero for identical sequences and one for maximally different sequences). To specify fractional difference, use the -distmo option (for an output file) or -distmi option (for an input file).

It is straightforward to add support for other measures of evolutionary distance; if this would be useful for your applications, let me know.

USEARCH distance matrices are usually "sparse", meaning that only a subset is calculated. Pairs with low identities (determined by a threshold and/or by word-counting heuristics) are omitted from the matrix, which can dramatically reduce the time and space required to compute and store a matrix for large sequence sets.

Four formats are supported for distance matrices: tabbed_pairs, square, phylip_square and phylip_lower_triangular. In formats that require all distances to be included (all except tabbed_pairs), unknown values are given as 0.0 (identity) or 1.0 (distance).

tabbed_pairs format
The matrix is stored as a tabbed text file. There are three fields in each line: Label1, Label2 and Distance. Pairs with distances that are unknown or below the threshold are omitted to save disk space. To simplify sequential parsing of this format, the file always starts with the diagonal, i.e. one line for each sequence giving a distance of zero from itself, e.g. Label1 Label1 1.0 (identity) or Label1 Label1 0.0 (difference). This ensures that the labels of all sequences are known before any off-diagonal entries are processed.

square format
The matrix is stored as a tabbed text file. There is one header line, followed by one line per sequence. The header line starts with an integer giving the number of sequences, followed by a tab-separated list of sequence labels. The following lines start with a sequence label, followed by a tab-separated list of distances. This format may become large, and hard to read by humans, when there are many sequences.

phylip_square and phylip_lower_triangular formats
These formats are compatible with Phylip, see Phylip documentation for format details. You should use -distmo fractdiff (fractional difference) if you are going to use Phylip to process a distance matrix generated by USEARCH because Phylip assumes a distance metric where zero means identical sequences and increasing values indicate decreasing similarity..