Home Software Services About Contact     
 
USEARCH v11

Understand your reads

See also
  OTU / denoising pipeline
  Read preparation

You should understand how your libraries were prepared and sequenced, and if your reads are "raw" or have undergone any processing before you got them.

If possible, you should get the raw FASTQ files before any processing, especially quality filtering or length trimming, because trimming may be harmful and the usearch expected error method is much better than most other quality filters. Third-party quality filters that give especially bad results include the PANDAseq assembler and the QIIME filter described by Bokulich et. al 2013 (see results in Edgar & Flyvbjerg 2014).

If you downloaded Illumina reads from the NCBI Short Read Archive, see the fastq_sra_splitpairs command. In my experience, reads in the SRA and other public repositories have often undergone undocumented and/or undesirable processing, so I suggest asking the authors for details and whether they can share the raw FASTQs.

The fastx_info command is very useful for getting a quick summary of the length distribution and quality of the reads in a FASTQ file. Check all of the FASTQ files in case some of them are different for some reason. Here is a bash loop to do that,

mkdir -p ../fastq_info

for fq in *.fastq
do
  usearch -fastx_info $fq -output ../fastq_info/$fq
done

Use grep to get a summary of one thing at a time (expected error (EE) distribution, length distribution, number of reads...) to check that all of the files are similar, otherwise some may need special treatment. E.g.,

cd ../fastq_info
grep "^EE" *

If you have paired reads, review the R1 and R2 files separately because they are usually different, e.g. the R2s tend to have lower quality.

Use the fastq_eestats2 command to review whether there are low-quality tails that should be truncated. If you have reads that vary in length, e.g. from 454 pyrosequencing, fastq_eestats2 will help you to decide which length to choose for trimming.

Get the primer sequences (the segments of the PCR oligos that bind to your gene, e.g. 16S primers), and confirm that at least the forward primer is present in the reads. If you have paired reads, then the R1s usually start with the forward primer and the R2s usually start with the reverse primer. You can do this using the search_oligodb command, e.g.

usearch -search_oligodb otus.fa -db primers.fa -strand both \
  -userout primer_hits.txt -userfields query+qlo+qhi+qstrand

If you don't find the primers then you should make sure you understand why. Maybe the library was prepared using an unusual strategy such as nested PCR, which may need some special processing in the pipeline, or maybe you have the wrong primer sequences.

If the FASTQ files are large, you can use the fastx_subsample command to get a smaller subset of reads for quick testing.
 
If you have paired reads, pick a couple of samples and try running the fastq_mergepairs command. Verify that most of the pairs align and that the lengths of the consensus sequences are consistent with the distribution expected for your primer pair. See links from the fastq_mergepairs command page for more documentation about adjusting parameters, verifying the results and trouble-shooting problems.