Distance matrix file
A distance matrix file contains pair-wise distances
between a set of sequences, samples or other pair-wise comparable objects.
Distance between sequences are specified as 1 –
fractional identity, so ranges for 0.0 for identical sequences to 1.0 for
sequences with 0% identity.
Distances between samples are the values of a
beta diversity metric.
With sequences, distance matrices are often "sparse", meaning
that only a subset is calculated. Pairs with low identities (determined by a
threshold and/or by word-counting heuristics) are omitted from the matrix, which
can dramatically reduce the time and space required to compute and store a
matrix for large sequence sets. Missing entries are assumed to be 1.0, i.e. the
maximum possible distance (equivalently, the lowest possible identity). This
often overstates the distance, but in most situations, pairs with low
identities are effectively ignored so it doesn't matter if the distance is,
say, 0.5 or 1.0, the result of a given analysis will be the same. Using
sparse matrices is therefore a useful optimization to reduce file sizes and
Four formats are supported for distance matrices:
tabbed_pairs, square, phylip_square and phylip_lower_triangular. In formats that
require all distances to be included (all except tabbed_pairs), unknown values
are given as 0.0 (identity) or 1.0 (distance).
The matrix is stored as a tabbed text file. There are three fields in each
line: Label1, Label2 and Distance. Pairs with distances that are unknown or
below the threshold are omitted to save disk space. To simplify sequential
parsing of this format, the file always starts with the diagonal, i.e. one line
for each sequence giving a distance of zero from itself, e.g.
Label1 Label1 0.0
(identity) or Label1 Label1 0.0 (difference).
This ensures that the labels of
all sequences are known before any off-diagonal entries are processed, which is
convenient for code which reads a matrix in this format.
The matrix is stored as a tabbed text file. There is one header line,
followed by one line per sequence. The header line starts with an integer giving
the number of sequences, followed by a tab-separated list of sequence labels.
The following lines start with a sequence label, followed by a tab-separated
list of distances. This format may become large, and hard to read by humans,
when there are many sequences.
phylip_square and phylip_lower_triangular formats
These formats are compatible with Phylip, see
Phylip documentation for format details. You should use -distmo
fractdiff (fractional difference) if you are going to use Phylip to process a
distance matrix generated by USEARCH because Phylip assumes a distance metric
where zero means identical sequences and increasing values indicate decreasing