Home Software Services About Contact     
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.



Sequence database files

Search commands require a -db option specifying a database filename. For u-commands (usearch_global, usearch_local and ublast) the database may be in FASTA format or UDB format. Other search commands (search_local and search_global) support FASTA only. The file format is automatically detected, so the -db option is used for both file types.

Indexed databases
The u-commands are designed to optimize search speed for large datasets. A key technique is using an index on the database that supports rapid retrieval of word counts or seeds. The index can be built in memory on the fly from a FASTA file, or can be pre-built and stored in a UDB file. Using FASTA can be convenient, but with large database load times are longer and more memory is required compared to using a UDB file. The memory required to store a UDB file in memory is approximately the same as the UDB file size. When indexes are created on the fly from a FASTA file, indexing options can be specified on the search command line. This also applies when a centroid database is constructed on the fly during clustering.

Non-indexed databases
The search_local and search_global commands do not use an index. They use a FASTA database file which is loaded into memory without creating an index. The memory required is approximately the same as the FASTA file size. Using these commands saves memory and can be convenient for small datasets, but searches are usually slower.

Command DB format Indexed? Time File size and RAM use
FASTA or UDB Yes Faster Larger, typical few x FASTA file size.
FASTA No Slower Smaller, typical ~FASTA file size.