fastq_sra_splitpairs command
Home Software Services About Contact     
 
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

24-Nov-2016
UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.

 

USEARCH v11

fastq_sra_splitpairs command

The NCBI Short Read Archive (SRA) stores paired reads in at least two different formats: interleaved and concatenated. These formats are not documented, to the best of my knowledge. You have to figure out which format you have by inspection. The fastx_info command is useful for a quick check. E.g., if the median sequence length is 600, then you probably have 2x300 reads in concatenated format.

The -mode option species the format. Valid values are interleaved and concatenated.

FASTQ output files for the R1 (forward) and R2 (reverse) reads are specified by the -output1 and -output2 options.

Interleaved format
With interleaved format, FASTQ records are R1, R2, R1, R2 etc. This is supported by the -interleaved option of fastq_mergepairs, so if you want to merge the pairs you may not need to run fastq_sra_splitpairs first.

Concatenated format
With concatenated format, the R1 and R2 sequences are combined into a single sequence with R2 immediately following R1 (as opposed to merged or assembled). There is no spacer or filler sequence separating the reads; they are simply concatenated. Sometimes, the reads are truncated by quality filtering before they are concatenated, in which case they are pretty much useless because it is impossible to recover the original R1s and R2s. If some or all of the reads are full-length, then the R1s and R2s can be recovered by splitting the sequence (and quality scores) at the half-way point. With concatenated format, the read length must be specified by the -readlength option, e.g. 250. If the concatenated sequence is 2x the read length (e.g. 500) then it is split at the midpoint, otherwise it is discarded because it is impossible to determine where the R1 ends and the R2 begins.

Example

usearch -fastq_sra_splitpairs SRA457665.fastq -mode concatenated \
  -readlength 250 -output1 fwd.fq -output2 rev.fq