Given a reference database D of sequences
in a sample that is assumed to be complete and correct, UPARSE-REF infers errors
in a sequence using parsimony. The goal of UPARSE-REF is to explain a given
sequence S with the fewest possible events starting from sequences in D. Here, "events" are
mutations that arise from PCR or sequencing errors. This is done by constructing
a model sequence M using one or more sequences from the database (refseqs).
Typically, M is a single refseq representing a non-chimeric amplicon. Otherwise,
M is made from m refseq segments that are concatenated to represent a chimeric
amplicon. If M has one segment, i.e. is a single refseq, then the distance
between M and S is defined to be the number of mismatches, which are interpreted
as sequencer or PCR errors.
The figure below shows an example where
the read has a chimeric model. Here, the penalty for a chimeric crossover is +3
and the penalty for a mismatch is +1. The total score for the model is 4 (+1 for
one mismatch +3 for one chimeric crossover).
UPARSE-REF is used internally as a step in
the UPARSE-OTU algorithm for OTU
construction (cluster_otus command). The main use for UPARSE-REF as a standalone command (uparse_ref)
is annotation of reads, OTUs and other sequences in mock community experiments
where the set of biological sequences in the sample is known.
UPARSE-REF does not perform well on the
benchmarks developed to validate ChimeraSlayer and UCHIME.
It is not intended as
a general-purpose chimera detection or chimera filtering algorithm.