UNCROSS algorithm
Home Software Services About Contact     
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.



UNCROSS algorithm

See also
  uncross command
  UNCROSS paper

Manual annotation of cross-talk
Cross-talk can be identified by examining an OTU table. If the lowest-abundance samples in a given OTU have much lower counts than the high-abundance samples, they are probably due to cross-talk and should be zero.

You can clearly see cross-talk in this GAIIx example and this MiSeq example.

See the UNCROSS paper for a more formal and complete description of the algorithm.

In a mock sample, a high-abundance unexpected OTU, i.e. an OTU which does not match a species in the designed community, is probably a contaminant.

A low-abundance unexpected mock count is probably cross-talk if it is also present in another sample. An alternative explanation is a low-abundance contaminant in the mock sample which is a valid OTU in the environmental samples by coincidence; this is a much less likely explanation. Another possible explanation is contamination which affects multiple samples, e.g. flow-cell residue from previous runs; this is also considered to be less likely than cross-talk.

Under these assumptions, mock samples enable a more sensitive test for the presence of cross-talk. For example, if an unexpected mock OTU has two reads and some other sample has ten reads then the most likely explanation is cross-talk. The anomalously large cross-talk rate of 2/12 = 17% of the reads can be explained by fluctuations due to sampling effects when there are small total numbers of reads, which can result in high outlier values for some OTUs. In environmental samples, OTUs cannot be considered as expected or unexpected so abundances of two and ten in an OTU with twelve total reads is not a reliable indicator of cross-talk.

Automated annotation of cross-talk
The UNCROSS algorithm uses simple heuristics to automate the manual procedure described above for annotating cross-talk. UNCROSS-Ref predicts cross-talk in mock samples where OTUs are annotated as matching (or not matching) the designed mock community. UNCROSS-Denovo predicts cross-talk in all samples considering read counts alone. These approaches are complementary. UNCROSS-Ref can identify unexpected OTUs by comparison with the database and is thus more sensitive to cross-talk in OTUs with low overall abundance, but cannot detect or correct cross-talk in environmental samples. UNCROSS-Denovo is less sensitive to cross-talk in OTUs with low overall read counts, but can detect cross-talk in environmental samples and can thus be used to detect and correct cross-talk in practice.

UNCROSS is not a robust solution for cross-talk
The mechanism(s) causing cross-talk are not well understood. Many different indexing schemes are used. Cross-talk rates in your data may be quite different from the datasets on which UNCROSS was designed and tested, in which case the accuracy of UNCROSS on your data may be lower. Also, cross-talk may be hard or impossible to detect when the number of multiplexed samples is large, say around 100 or more. It is much better to use multiplexing strategies that are designed to reduce cross-talk. UNCROSS is best understood as a simplisitc hack that is the best we can do with exisitng data.


Schematic illustration of the UNCROSS-Denovo algorithm. The OTU table entries for a given OTU are shown sorted by decreasing count (number of reads). If a count is at least 2% then it is classified as valid. If a count is <= 0.5% of the total over valid counts, it is predicted to be due to cross-talk. Intermediate values are classified as undetermined.