| See also OTU clustering 
Preclustering is a method suggested by
Huse et al. 
(2010). The key observation is that reads with errors (or 
amplicons with errors) should be less abundant than correct reads (or correct 
amplicons). This is because sequencer errors are unlikely to be reproduced by 
chance, and amplicons with PCR errors will have undergone fewer rounds of 
amplification. 
This suggests the following technique: if a read (R) has 
only a small number of differences with a read of higher abundance (H), then 
assume H is the correct read corresponding to R, so add the abundance of R to H 
and discard R. 
This merging is performed by 
cluster_otus, which simultaneously performs chimera filtering and greedy OTU 
construction by considering reads in order of decreasing abundance. 
Preclustering can increase sensitivity 
If you follow my recommendation 
to discard singleton reads, sensitivity may be reduced because "lone 
singletons" are lost, i.e. cases where the highest-abundance read for a given 
species is a singleton. Some of these species can be preserved by a 
preclustering step that very similar reads. If there is one correct read for a 
species, there may be one other read of the same gene that has one error. In 
this case, the species has two lone singleton reads, and these will be lost when 
singletons are discarded. They can be preserved by merging reads that have only 
a single difference, e.g. using this command: 
  usearch -cluster_smallmem derep.fa -id 0.99 -maxdiffs 
1 -centroids preclustered.fa 
Motivation for maxdiffs 1 The choice of -id 0.99 is arbitrary; an identity threshold must be provided because it is required by cluster_smallmem. If two reads have only a single difference, then most likely one of them is correct (because it is very unlikely that two bad reads would agree on all errors except one). So the maximum of one difference is not arbitray. If larger numbers of differences are allowed, then two bad reads may be merged and the merged read is more likely to create a spurious OTU. |