global trimming

Global trimming for ITS amplicon reads

ITS amplicons have large variations in length due to the biology of the region -- some of the sequence evolves neutrally, and long indels are common.

This is the strategy I currently recommend for global trimming for ITS reads.

1. Pick a fixed length which is as long as possible without losing a large fraction of the reads because they have expected errors > 1 (or your chosen e.e. threshold). The fastq_eestats command is useful for figuring out a good compromise. Call this length L_trim.

2. If a match to the reverse primer is present, then delete the matching letters and any letters after that.

3. Delete if the read is shorter than a reasonable length given your primer pair, then discard the read.

4. If the read is longer than L_trim, truncate to L_trim.

5. If the read is shorter than L_trim, pad with Ns so that it is L_trim letters.

Step 5 is needed because cluster_otus considers terminal gaps to be real differences. After this step, all your reads should now have length L_trim.

Steps 2 - 5 should be done before quality filtering by max e.e. You will need to write your own script to do this as usearch currently doesn't have commands with the necessary features. You can use the search_oligodb command to find the reverse primer matches.

Once you've pre-processed the reads to get them to the fixed length, proceed as usual to make UPARSE OTUs: quality filter, dereplicate, discard singletons, and run cluster_otus.

Finally, you'll need to strip the trailing Ns (added in step 5) from the OTU sequences.