"non-redundant" (NR) database contains only one representative of a given type
of sequence. Dereplication removes
identical sequences. Clustering at a lower threshold, e.g. 90%, may reduce the
database size, enabling faster searches with only a small loss in sensitivity.
See also database optimization.
In marker gene metagenomics, reads of genes such as
small-subunit RNA (16S, 18S and ITS) and cytochrome oxidase I (COI) are often
clustered into groups called Operational Taxonomic Units (OTUs), typically at a
97% identity threshold. The UPARSE pipeline
achieves the best throughput and highest published biological accuracy at the
time of writing (Nature
Methods, Aug 2013). The UCHIME algorithm
can be used for stand-alone chimera filtering in an OTU pipeline. UCHIME is
implemented in the uchime_ref and
amplicon reads, e.g. from 16S marker genes, antibody or T-cell receptor (TCR)
immune system repertoire sequencing, can be used to estimate the biological
diversity represented in the amplicons.