The estimated memory required for a database index is:
Bytes = S + 4 D + 8 W.
Here, S is size of the FASTA file containing the sequences, D is the number of words or seeds in the database that are indexed, and W is either the number of possible words, or the number of slots if a hashing is used (-slots option)
If the database is very large, then this will be dominated by the database size D. If the word length is long, then it will be dominated by W = Ak where A is the size of the alphabet (4 for nucleotides, 20 for the standard amino acid alphabet or less if a compressed alphabet is used), and k is the word length (or effective word length, if a pattern is used).
To significantly reduce the amount of memory needed, we can use any combination of the following strategies.
Reduce the database size by clustering
If your database has a lot of redundancy, then it may be reasonably to reduce the size by clustering. For example, 16S rRNA gene reference database often have many sequences that are 100% identical, or very close, especially if the sequences are trimmed to sequencing primers (which is generally recommended).
Reduce the database size by splitting
You can split the database into pieces and run the same query on each piece separately (serially, i.e. one after the other, or in parallel e.g. on a cluster). A drawback of this strategy is that search and clustering speed in USEARCH often has sub-linear scaling. This means that if you split a database into two pieces, it may take more than twice as long to do the search.
Reduce the number of indexed words or seeds
For high-identity search and clustering, we can index only a subset of words/seeds in the database with only a small loss in sensitivity. This is the strategy used by MEGABLAST, and a similar strategy can be used in USEARCH with the -dbstep N option which corresponds to the stride parameter of MEGABLAST. This specifies that only every Nth word should be indexed. So for example with -word_length 16 -dbstep 16, every 16th 16mer will be indexed and the database will be covered completely with non-overlapping words.
With large databases and high-identity searches, it can be advantageous to use longer word lengths. For example, we might use nucleotide 16mers instead of the default 8mers. Then W = Ak = 416 = 4.3 x 109 and we need 34 Gb just for the 8W term in the index. This is where hashing can save a substantial amount of memory. Use of a hashed index is specified by the -slots option, which specifies the number of index slots, which is generally chosen to be << the number of different possible words. The number of slots should ideally be (i) a prime number and (ii) large enough that hash collisions are rare. For (i), the Prime Pages site has a handy page that you can use to find a prime close to a desired number. Condition (ii) is trickier, but in practice there's not much point in worrying about it because the amount of available RAM is a stronger constraint. The rule of thumb is: use as many slots as you can.