Home Software Services About Contact     
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.



Distance matrix file

See also
calc_distmx command

  cluster_aggd command

A distance matrix file contains pair-wise distances between a set of sequences, samples, OTUs or other pair-wise comparable objects.

Distance between sequences are specified as 1 - fractional identity, so ranges for 0.0 for identical sequences to 1.0 for sequences with 0% identity.

Distances between samples are the values of a beta diversity metric.

With sequences, distance matrices are often "sparse", meaning that only a subset is calculated. Pairs with low identities (determined by a threshold and/or by word-counting heuristics) are omitted from the matrix, which can dramatically reduce the time and space required to compute and store a matrix for large sequence sets. Missing entries are assumed to be 1.0, i.e. the maximum possible distance (equivalently, the lowest possible identity). This often overstates the distance, but in most situations, pairs with low identities are effectively ignored so it doesn't matter if the distance is, say, 0.5 or 1.0, the result of a given analysis will be the same. Using sparse matrices is therefore a useful optimization to reduce file sizes and execution times.

The matrix is stored as a tabbed text file. There are three fields in each line: Label1, Label2 and Distance. Pairs with distances that are unknown or below the threshold are omitted to save disk space. To simplify sequential parsing of this format, the file always starts with the diagonal, i.e. one line for each sequence giving a distance of zero from itself, e.g.

Label1 Label1 0.0 (identity) or Label1 Label1 0.0 (difference).

This ensures that the labels of all sequences are known before any off-diagonal entries are processed, which is convenient for code which reads a matrix in this format.