Low-complexity sequences are simple repeats such as ATATATATAT or regions that are highly enriched for just one letter, e.g. AAACAAAAAAAAGAAAAAAC. Protein segments with only a few amino acids are also considered to be low complexity, e.g. PPCDPPPPPKDKKKKDDGPP. This could align with a high score to another region with many Ps and Ks, but that would not necessarily indicate an evolutionary relationship.
Repetitive and low-complexity sequences cause problems for search and clustering algorithms based on matching words or patterns. Low-complexity sequences cause certain words to have high frequencies, which can cause performance problems if they are not masked. For example, words that are mostly or all composed of a single letter such as AAAAAA or TTTTCTTT often have have high frequencies. For UBLAST, most of these words would be false positives if used as alignment seeds. For USEARCH, they are expensive to count and degrade the correlation between word count and sequence identity.
Soft and hard masking
Soft masking indicates masked regions by using lower-case letters. Hard masking (-hardmask option) overwrites masked regions with a wildcard letter, N for nucleotides or X for proteins.
Masking excludes words and seeds
In USEARCH, masking is used only for one purpose: for excluding seeds or word matches. In the case of making an index, a word or seed is not indexed if it contains one or more masked letters. Similarly, a word or seed in the query sequence is not considered if it has any masked letters.
Masked regions are included in the alignment score
An alignment will not be initiated in a masked region (because seeds are excluded), but may extend through a masked region. In USEARCH, masked regions are always included in the score. Hard masking can be used to exclude them from the score (because a wildcard letter has zero substitution score against all letters).
USEARCH supports four masking algorithms as shown in the table.
|fastamino||protein||Unpublished method. Default for proteins.|
|fastnucleo||nucleotide||Unpublished method. Default for nucleotides.|
|seg||protein||Entropy-based method as used by BLASTP.|
|dust||nucleotide||Ad-hoc method as used by BLASTN.|
The fastamino and fastnucleo methods were developed because the seg and dust methods used by BLAST are slow enough to have a significant impact on search times with the faster algorithms used by USEARCH. These masking methods emphasize detection of simple repeats and tend to mask less than dust and mask. In my experience, they are effective for most applications where USEARCH is commonly used.