Home Software Services About Contact usearch manual
Mapping reads to OTUs

See also
  OTU clustering
  UPARSE commands
  cluster_otus command
  uc2otutab.py Python script

Output from cluster_otus is a FASTA file containing OTU representative sequences. Downstream analysis often requires an "OTU table", a matrix that gives the number of reads from each sample that is assigned to each OTU. To create an OTU table, the first step is to map reads to OTUs. This can be done by searching the reads as a query set against the OTU representative sequences as the database. Then you can parse the output file to create a OTU table.

The sample identifier is provided by a field in this format in the label:

barcodelabel=SAMPLE_ID;

This annotation can be added to the label using a script appropriate for your reads, , e.g. fastq_strip_barcode_relabel.py.

Reads should be converted to FASTA format before performing the search. This can be done using the fastq_filter command.

OTU sequences should be re-labeled with convenient labels for parsing, e.g. OTU_1, OTU_2 ... OTU_N where N is the number of OTUs. This can be done with the fasta_number.py script:

python fasta_number.py otus.fa OTU_ > otusn.fa

Then you can run the search using the usearch_global command:

usearch -usearch_global reads.fa -db otusn.fa -strand plus -id 0.97 -uc readmap.uc

The uc output file format is suggested as this is the only USEARCH output file format that reports query sequences that did not match the database. You can use grep to find how many reads did not match by counting the lines that start with an N (no-hit records):

grep -c "^N" readmap.uc

Note that a given read may match two or more OTUs at the given identity threshold. In such cases, usearch_global will tend to assign the read to the OTU with highest identity and will break ties arbitrarily.

Some reads may not match any OTU for these reasons: (1) the read is chimeric, (2) the read has more than 3% errors, (3) the read has a singleton sequence so was discarded.