The estimated memory required for a database index is:
Bytes = S + 4 D + 8 W.
Here, S is size of the FASTA file containing the
sequences, D is the number of words or seeds in the database that are indexed,
and W is either the number of possible words, or the number of slots if a
hashing is used (-slots option)
If the database is very large, then this will be
dominated by the database size D. If the word length is long, then it will be
dominated by W = Ak
where A is the size of the alphabet (4 for nucleotides, 20 for the standard
amino acid alphabet or less if a compressed alphabet is used), and k is the word
length (or effective word length, if a pattern is
To significantly reduce the amount of memory needed, we
can use any combination of the following strategies.
Reduce the database size by clustering
If your database has a lot of redundancy, then it may be reasonably to
reduce the size by clustering. For example, 16S rRNA gene reference database
often have many sequences that are 100% identical, or very close, especially if
the sequences are trimmed to sequencing primers
(which is generally recommended).
Reduce the database size by splitting
You can split the database into pieces and run the same query on each piece
separately (serially, i.e. one after the other, or in parallel e.g. on a
cluster). A drawback of this strategy is that search and clustering speed in
USEARCH often has sub-linear scaling. This means that if you split a database
into two pieces, it may take more than twice as long to do the search.
Reduce the number of indexed words or seeds
For high-identity search and clustering, we can index only a subset of
words/seeds in the database with only a small loss in sensitivity. This is the
strategy used by MEGABLAST, and a similar strategy can be used in USEARCH with
the -dbstep N option which corresponds to the stride parameter of MEGABLAST.
This specifies that only every Nth word should be indexed. So for example with -word_length
16 -dbstep 16, every 16th 16mer will be indexed and the database will be covered
completely with non-overlapping words.
With large databases and high-identity searches, it can be advantageous to
use longer word lengths. For example, we might use nucleotide 16mers instead of
the default 8mers. Then W = Ak = 416 = 4.3 x 109
and we need 34 Gb just for the 8W term in the index. This is where hashing can
save a substantial amount of memory. Use of a hashed index is specified by the
-slots option, which specifies the number of index slots, which is generally
chosen to be << the number of different possible words. The number of slots
should ideally be (i) a
prime number and (ii) large enough that hash collisions are rare. For (i),
the Prime Pages site has a
handy page that you
can use to find a prime close to a desired number. Condition (ii) is trickier,
but in practice there's not much point in worrying about it because the amount
of available RAM is a stronger constraint. The rule of thumb is: use as many
slots as you can.
Reducing the maximum
In some cases, long sequences can cause excessive memory use in version 6.0.
This is being addressed, and will hopefully be improved in v6.1. In the mean
time, setting a lower value of -maxseqlength
may reduce memory requirements if the input data has long sequences, especially
if many threads are being run in parallel sharing the same memory space.