Home Software Services About Contact usearch manual

Illumina unpaired reads of variable-length amplicons

See also
 
UPARSE home page
  UPARSE pipeline home page

If you have unpaired reads of amplicons which vary significantly in length, then you may have to deal with an issue which is not usually considered when constructing a UPARSE pipeline.

Are some amplicons shorter than the read length?
The critical question is whether some amplicons are shorter than the read length. If the answer is no, then there is nothing special to do; the standard UPARSE approach will work fine and you don't need to read the rest of this page.

If some amplicons are shorter than the read length, then reads of those amplicons will continue into non-biological bases, creating "overhangs" containing sequencing primer, adapter and eventually random junk. With paired reads, this isn't usually a problem because paired read merging detects and discards overhangs when the forward read extends past the beginning of the reverse read or vice versa ("staggered" pairs). With unpaired reads, you have to find and delete the overhangs by some other method.

How to check if you have reads
If you have reads, then the easiest way is to search for the reverse primer(s) using search_oligodb. If you don't find it, then you don't have the problem. There are a few subtleties here to be careful of. The reverse biological primer may or may not appear in the read, depending on which PCR protocol(s) were used to make the sequencing constructs. The reverse sequencing primer must be present in the construct, and the reverse adapter must also appear in the construct. So to be safe, it's probably a good idea to search for those if you're not 100% sure whether the biological primer is in the sequencing construct. (Strictly speaking, when I say biological primer here I mean the primer-binding bases in the biological sequence)..

How to check if you don't have reads
If you don't have reads, e.g. because you're still in the process of designing the experiment, or you have reads and want to double-check that the amplicons are long enough (never a bad idea!) then you can use search_pcr on a reference database of known full-length sequences for your gene / region. This will generate a set of predicted amplicon sequences which terminate at your primers. Use sortbylength to sort the amplicons by decreasing length and check the length of the last sequences in the file. There may be a few spurious artifacts that are very short, so be sure to check manually to be sure everything looks reasonable. If the shortest predicted amplicon is longer than your read length, then you are probably ok and don't need to worry about this issue.

How to find and delete overhangs
The simplest method for deleting overhangs is to trim all the reads to a sufficiently short length using fastx_truncate or fastq_filter. Suppose for example you have 300nt reads and you find that 190nt would be short enough to ensure that there are no overhangs. In this case, I would probably recommend trimming all reads to 190nt. This may be a hard sell to your boss (why are you throwing away all those bases!?), but it has some advantages. It's simple, reliable and solves the problems that occur when you have varying read lengths (see global trimming for discussion). Also, at the time of writing, 300nt reads tend to have poor quality towards the end and truncating them will retain more reads after quality filtering and improve sensitivity to low-abundance species.

If truncating reads of short amplicons is not an option (your boss, or it discards too many bases), then you need to find the overhangs by sequence matching. The best way to do this depends on the details of your sequencing construct and the gene / region you are sequencing. If the biological primer-matching bases are in the reads, then searching for the (reverse-complemented) primer is probably the most reliable method. You can do this using search_oligodb. You'll need to capture the position of the primer in the read, e.g. using -userout hits.txt ‑userfields query+target+qlo+qhi+qstrand -output_no_hits then write your own script to read hits.txt and the FASTQ file for the reads to do the truncating. If you have a 64-bit license and you don't have the scripting skills to do this, let me know and I'll take a look at adding a new command for you. It may be possible within a few days.