USEARCH manual

Mapping reads to OTUs

Output from cluster_otus is a FASTA file containing OTU representative sequences. Downstream analysis often requires an "OTU table", a matrix that gives the number of reads from each sample that is assigned to each OTU. To create an OTU table, the first step is to map reads to OTUs. This can be done by searching the reads as a query set against the OTU representative sequences as the database. Then you can parse the output file to create a OTU table.

The input to the mapping step should be reads after any applicable paired read merging and length truncation, but before quality filtering or discarding singletons. A singleton or low-quality read that aligns to an OTU centroid sequence with >=97% identity can & should be included in that OTU. The reasoning for this is that errors are very unlikely to increase sequence similarity by chance, so we can be confident that the correct biological sequence for that read has identity with the OTU that is at least as great as the identity we get from the alignment. Mapping recovers many or most of the reads that were discarded before the cluster_otus step and gives a better estimate of the OTU abundance in terms of the number of reads that belong in the OTU.

If you want to convert the output from the mapping to an OTU table, then reads must be annotated with sample names. I use the following convention: the sample identifier is provided by a field in this format in the query sequence label:

barcodelabel=SAMPLE_ID;

This annotation can be added to the label using a script appropriate for your reads, e.g. fastq_strip_barcode_relabel.py. The name "barcodelabel=" is confusing. It would be better to use something more self-explanatory like "samplid=", but unfortunately I'm stuck with barcodelabel= for backwards compatibility.

I also recommend that the OTU sequences have convenient labels for parsing, e.g. OTU_1, OTU_2 ... OTU_N where N is the number of OTUs. This can be done using the -relabel option of cluster_otus.

To do the actual mapping, you can run the search using the usearch_global command, for example:

usearch -usearch_global reads.fastq -db otusn.fa -strand plus -id 0.97 -uc readmap.uc \
-maxaccepts 8 -maxrejects 64 -top_hit_only

The uc output file format is suggested as this is the only USEARCH output file format that reports query sequences that did not match the database. You can use grep to find how many reads did not match by counting the lines that start with an N (no-hit records):

grep -c "^N" readmap.uc

Note that a given read may match two or more OTUs at the given identity threshold. In such cases, usearch_global will tend to assign the read to the OTU with highest identity and will break ties arbitrarily by default. You can mitigate these issues and get more consistent OTU assignments by using -top_hit_only and a larger value of maxaccepts, as shown in the above example. The top_hit_only option breaks ties consistently by choosing the first target sequence in database order.

Some reads may not match any OTU for these reasons: (1) the read is chimeric, (2) the read has more than 3% errors, (3) the read has a singleton sequence so was discarded.