Number of slots in the hash index. Should be set to a
prime number as large as possible given the amount of available memory (see
Interval between words in a database sequence that are
indexed. Default is 1. Increasing this value saves memory.
Minimum sequence length.
How to choose parameters
The -derep_subseq algorithm is based on
USEARCH, so it tests database sequences in order of decreasing number of
(indexed) words in common. Parameters should therefore be chosen to ensure
that there is at least one word in common between a substring and the
full-length sequence in the database. The longest possible word length
should be used since this reduces the number of false positives, i.e.
sequences with an identical word that do not match over the full length of
the shorter sequence. Only one word in common is fine in this situation.
Suppose we tile the database with
non-overlapping 32-mers using -w 32 and -dbstep 32. Then a substring of
length 64 or more must have at least one matching 32-mer, as shown in the
As this example shows, if the
minimum sequence length is L, we use a word length w <= L/2 and -dbstep w,
then we have (1) minimized memory use (we use the fewest possible words by
tiling the database) and (2) a substring is guaranteed to have at least one
word in common. Note that all words in the query are considered; the -dbstep
option only applies to the database sequences. If w is also large enough
that random word matches are very unlikely, then this algorithm is usually
very effective, but is still heuristic strictly speaking unless we use
There is one small problem I
glossed over. If -dbstep 32 is used, the database sequence is not fully
covered unless its length is a multiple of 32. There is usually a fragment
at the end of length < 32, indicated by /// in the figure. There may be some
false negative matches for substrings that match such fragments. This can be
mitigated by using a dbstep value that is smaller than w, say w/2.