Sample identifiers in read labels
Home Software Services About Contact     
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.



Sample identifiers in read labels

Making an OTU table
An OTU table is made by the otutab command. The query set, i.e. the FASTA file or FASTQ file containing the reads, must have sample identifiers in the labels.

Why different ways to do it?
Usearch supports different ways to put sample names into sequence labels to provide some degree of backwards compatibility with earlier versions and to allow flexibility in the formatting of sample names which were probably designed without thinking about the software package. For example, QIIME does not allow an underscore in the sample identifier, which is too restrictive in my opinion.

How to check that your sample names are formatted correctly
Use the fastx_get_sample_names command.

Sample identifier syntax
The sample name can be specified by putting sample=xxx; into the label. The semi-colon marks the end of the sample identifier, so semi-colons are not allowed but any other character may be used. If sample= is not found, the sample identifier is assumed to start at the beginning of the label and continue to the first character in the label which is not alphanumeric or an underscore, unless the sample_delim option is specified (see below). Put another way, any character which is not a letter, number or underscore marks the end of the sample label. The following labels have sample identifier S01. FASTA labels start with > at the beginning of the line, FASTQ labels start with @.


In the first and second example, the period (.) is the first non-alphanumeric character so the .123 is not part of the sample identifier.

The -sample_delim option
This option specifies a string of one or more characters that marks the end of a sample identifier. If this option is used, the sample idenfier must begin with the first character in the label and continues until the first match to the delimiter string. For example, if you have reads that were processed with QIIME, then read labels start with the sample identifier which is followed by an underscore (_) and an integer read number. Input in this format can be processed like this:

usearch -otutab qiime_reads.fq -sample_delim _ -otutabout otutable.txt

How to get sample names into your labels
The simplest method is to use the fastx_relabel command or the -relabel option of fastq_mergepairs, fastq_filter or fastx_uniques. If you process one file at a time, you can do something like this:

usearch -fastx_uniques reads.fastq -relabel SampleName. -fastaout uniques.fa

Note the period following SampleName.

If -relabel @ is specified, the sample name is constructed from the FASTQ filename by truncating at the first underscore or period. With typical Illumina FASTQ filenames, this is the sample name.

Alternatively, you could write you own script to do this task.