Dereplication is the identification of unique sequences so that only one copy of each sequence is reported. USEARCH supports dereplication with the derep_fulllength and derep_prefix commands.
Different definitions of duplicate sequences are possible, as shown in the figure.
The full-length definition is the easiest to understand and also the easiest to implement in an algorithm: if two or more sequences are identical, all except one are kept, for example:
A = GATTACA
With prefixes, a sequence A is discarded if it is a prefix of some other sequence B in the set, for example:
A = GATTAC
With substrings, a sequence A is discarded if it is a substring of some other sequence B, for example:
A = -ATTAC-
All prefixes are substrings, and full-length matches are both prefixes and substrings. So:
substrings >= prefixes >= full length matches.
USEARCH supports full-length
and prefix dereplication, but currently not substring. In
next-generation sequencing applications, prefix dereplication is
useful for quality-trimmed reads, which are often truncated at
their right-hand end where quality scores tend to fall, but rarely
if ever at their left-hand ends. So far, I haven't come across a
case where substring dereplication is needed -- if you know of one,
please let me know.