
See also
UPARSE home page
UPARSE pipeline
home page
OTU
benchmark results
UPARSE
algorithm
Tutorial examples
Overview
Reads must be processed before clustering and
mapping to OTUs. Exactly which steps are needed depend on which library
preparation, PCR and sequencing methods were used.
Paired read assembly
If you have overlapping paired
reads, then the first step is to assemble them using
fastq_mergepairs.
Non-biological sequence should be removed
It's important
to understand the structure of your reads,
e.g. where barcodes, primers and adapters might appear. If some or all of
the reads contain non-biological sequence then those bases should be
stripped before clustering. Segments which match primers should also be
stripped because PCR tends to substitute mismatched positions with
complementary bases. You can strip primer-binding segments by using the
‑stripleft and -stripright options of
fastx_truncate.
Quality filtering
Generating high-quality OTUs requires that a high-quality subset of the reads is
selected before running the cluster_otus
command. Otherwise, many spurious OTUs may be generated by reads with too
much sequencing error.
The high-quality subset may be a relatively small fraction of the reads, but
most of the discarded reads are not lost because most of them will map to OTUs when the
OTU table is constructed.
Quality filtering of FASTQ reads is done by the fastq_filter command. Use the -fastq_maxee option to set an expected error threshold, 1.0 is recommended as a default. Don't worry if this filter appears to be stringent; you will probably find that most of the discarded reads map to OTUs after clustering.
Discarding singletons
Discarding singletons is an effective error filter because most
singleton reads have at least one base call error. It is therefore
recommended to use the -minsize 2 option of cluster_otus, which ignores
singletons. As with low quality reads, most singletons will
map to an OTU when the OTU table is
constructed, so the data is not lost. This technique is especially useful
when quality scores are not available (FASTA reads).
You could also use the sortbysize command
with the -minsize 2 option to discard singletons.
Sample identifiers should be added to the read labels
See Sample identifiers in read labels
for discussion.
Length trimming
If you have variable-length reads, such
as 454, then you should trim them to a fixed length. See
global trimming for discussion.
Length trimming can be done with the fastx_truncate command or by using the -fastq_trunclen option of fastq_filter. This enables length trimming and quality filtering to be done in a single step.
If you have overlapping paired reads, then you probably don't need to trim the length.
If you have Illumina unpaired reads, then they should be fixed-length and
you probably don't need to trim. The only reason you might do this is if the
read quality is too bad towards the end of the read, in which case it might
be better to truncate them to something shorter. You can use the
fastq_eestats2 command to check the
quality. Or, you might need to trim to a fixed if the length varies due to
preprocessing by some other tool.