Home Software Services About Contact usearch manual


Read preparation

See also
 
UPARSE home page
  UPARSE pipeline home page
  OTU benchmark results
  UPARSE algorithm
  Tutorial examples

Overview
Reads must be processed before clustering and mapping to OTUs. Exactly which steps are needed depend on which library preparation, PCR and sequencing methods were used.

Paired read assembly
If you have overlapping paired reads, then the first step is to assemble them using fastq_mergepairs.

Non-biological sequence should be removed
It's important to understand the structure of your reads, e.g. where barcodes, primers and adapters might appear. If some or all of the reads contain non-biological sequence then those bases should be stripped before clustering. Segments which match primers should also be stripped because PCR tends to substitute mismatched positions with complementary bases. You can strip primer-binding segments by using the ‑stripleft and -stripright options of fastx_truncate.

Quality filtering
Generating high-quality OTUs requires that a high-quality subset of the reads is selected before running the cluster_otus command. Otherwise, many spurious OTUs may be generated by reads with too much sequencing error. The high-quality subset may be a relatively small fraction of the reads, but most of the discarded reads are not lost because most of them will map to OTUs when the OTU table is constructed.

Quality filtering of FASTQ reads is done by the fastq_filter command. Use the -fastq_maxee option to set an expected error threshold, 1.0 is recommended as a default. Don't worry if this filter appears to be stringent; you will probably find that most of the discarded reads map to OTUs after clustering.

Discarding singletons
Discarding singletons is an effective error filter because most singleton reads have at least one base call error. It is therefore recommended to use the -minsize 2 option of cluster_otus, which ignores singletons. As with low quality reads, most singletons will map to an OTU when the OTU table is constructed, so the data is not lost. This technique is especially useful when quality scores are not available (FASTA reads). You could also use the sortbysize command with the -minsize 2 option to discard singletons.

Sample identifiers should be added to the read labels
See Sample identifiers in read labels for discussion.

Length trimming
If you have variable-length reads, such as 454, then you should trim them to a fixed length. See global trimming for discussion.

Length trimming can be done with the fastx_truncate command or by using the -fastq_trunclen option of fastq_filter. This enables length trimming and quality filtering to be done in a single step.

If you have overlapping paired reads, then you probably don't need to trim the length.

If you have Illumina unpaired reads, then they should be fixed-length and you probably don't need to trim. The only reason you might do this is if the read quality is too bad towards the end of the read, in which case it might be better to truncate them to something shorter. You can use the fastq_eestats2 command to check the quality. Or, you might need to trim to a fixed if the length varies due to preprocessing by some other tool.