USEARCH manual

Misop tutorial
MiSeq 2x250 PE reads, mouse feces and mock community samples

Part 1: UPARSE pipeline (reads to OTU table)
Part 2: Mock community analysis (stringent)
Part 3: Mock community analysis (less stringent)
Part 4: Read quality and error rate analysis

Part 2. Mock community analysis
This analysis illustrates a very important point that is often misunderstood in the literature: it is NOT reasonable to validate an OTU clustering algorithm by testing whether the number of OTUs is the same as the number of species in the mock community. You MUST understand what the OTUs represent (expected species? contaminants? read errors? chimeras?) and how many clusters can reasonably be expected (are some species >97% similar? are some species absent from the reads?). With the parameters we use here in Part 1, the number of OTUs is close to the expected number (21 species), but a couple of them are misleading. In other studies, the discrepancies are larger; for example it is quite common with HMP "Uneven" mock communities for several species to be missing from the reads. Part 3 of this tutorial will show that there are at least 40 species in the mock data, so an algorithm cannot be criticized for making ~40 OTUs from these reads and the fact that we get 20 here in Part 2 is partly due to luck.

Here we look at the results in misop/mock which were generated by the run_mock.bash script. The otus.fa file contains the OTU representative sequences. Notice that there are 20 OTUs, but there are 21 species in the designed community. See misop/ref/HMP_MOCK.v35.fasta for the known 16S sequences in those species.

The uparse_ref command is used to compare OTUs with a reference database for the mock community; HMP_MOCK.v35.fasta in our case. The command line in run_mock1.bash is:

$usearch -uparse_ref $out/otus.fa -db ../ref/HMP_MOCK.v35.fasta -strand both \
-uparseout $out/otus.uparseref -log $out/uparse_ref.log

The output file is out/mock/otus.uparseref. Look at this file using a Linux command like less, more or cat, or alternatively open it in a text editor such as vi or emacs. See uparseout option for description of the file format. You'll see that most of the lines show a perfect 100% match to a reference sequence, like this:

Otu1 perfect 100.0 100.0 S.epidermidis.1
Otu2 perfect 100.0 100.0 A.baumannii.1
Otu3 perfect 100.0 100.0 B.cereus.1

However, there is one execption:

Otu20 other 94.1 94.1 P.aeruginosa.1

This is only 94% similar to the closest reference sequence. We can try to identify it by searching a larger database of 16S sequences, e.g. the gold UCHIME reference (right-click on the link to download). Make a file otu20.fa (e.g. by copying otus.fa and deleting all the other sequences), then run:

usearch -usearch_global otu20.fa -db gold.fa -strand plus -id 0.99 -alnout otu20.aln -notrunclabels

(The -notrunclabels option is needed because there are blanks in the species names in gold.fa).

If you look in otu20.aln you will see a 100% match to AJ492829_S000145154, which is Pseudomonas poae. So this is another perfect sequence, but not from the designed community. Normally I would suspect a contaminant in the sample before sequencing, but in this case it is because the mock and fecal samples were sequenced in the same run as some soil samples. Pseudomonas is one of the most abundant OTUs in the soil samples, so this is a contaminant which is probably explained by sample cross-talk in the sequencing, e.g. due to barcode read errors.

So we have 19 perfect OTUs from the expected community and one perfect contaminant sequence. This leaves two species in the designed community which do not have OTUs: P.acnes and S.aureus.

S.aureus drops out because it is 99.4% identical to S.epidermidis, so the S.aureus reads are assigned to the S.epidermidis OTU. To check this, make a file pair.fa containing sequences S.epidermidis.1 and S.aureus.1 from HMP_MOCK.v35.fasta and run:

usearch -cluster_fast pair.fa id 0.9 -alnout pair.aln

Look in pair.aln to see the alignment. You can find the aureus reads by making a file s_aureus.fa with just the S.aureus reference sequences and running this command:

usearch -usearch_global merged.fq -db s_aureus.fa -strand both -id 1.0 \
-userout s_aureus_reads.txt -userfields query

Note the use of -id 1.0 to select reads that are 100% identical in order to filter out S.epidermidis reads. You will find 513 hits, so S.aureus is present but does not induce a separate OTU.

That leaves P.acnes. We can perform the same exercise to see if P.acnes is present in the reads. Make a file p_acnes.fa with just the P.acnes reference sequence and run:

usearch -usearch_global ../fq/Mock*_R1_*.fastq -db p_acnes.fa -strand both -id 0.97 \
-userout p_acnes_fwd.txt -userfields query+id

You will find three hits in p_acnes_hits.txt, one with 100.0% identity and two with 99.6% identity. If you do the same with the reverse reads (_R2_), you will find no hits. We can ask if the P. acnes reads were merged by using the -tabbedout option of fastq_mergepairs:

usearch -fastq_mergepairs ../fq/Mock*_R1_*.fastq -fastqout merged.fq -tabbedout merge_tab.txt

Get the P. acnes read labels:

cut -f1 p_acnes_fwd.txt > p_acnes_labels.txt

Find the records for those reads in merge_tab.txt:

grep -Ff p_acnes_labels.txt merge_tab.txt

You should see three records like this:

... diffs=26 diffpct=10.5 diffpcttoohigh result=notmerged

So P. acnes fell out of the analysis because the R2s failed to merge due to having a large number of differences. They will be included if you set the -fastq_maxdiffs and -fastq_maxdiffpct to sufficiently high values, e.g. -fastq_maxdiffs 99 -fastq_maxdiffpct 99, though this would not usually be recommended because you will tend to get many false-positive merges. Another possible strategy is to make OTUs from the forward reads only.

Part 3. Mock community analysis with singletons and lower-quality reads included >