fastq_sra_splitpairs command

The NCBI Short Read Archive (SRA) stores paired reads in at least two different formats: interleaved and concatenated. These formats are not documented, to the best of my knowledge. You have to figure out which format you have by inspection. The fastx_info command is useful for a quick check. E.g., if the median sequence length is 600, then you probably have 2x300 reads in concatenated format.

The -mode option species the format. Valid values are interleaved and concatenated.

FASTQ output files for the R1 (forward) and R2 (reverse) reads are specified by the -output1 and -output2 options.

Interleaved format
With interleaved format, FASTQ records are R1, R2, R1, R2 etc. This is supported by the -interleaved option of fastq_mergepairs, so if you want to merge the pairs you may not need to run fastq_sra_splitpairs first.

Concatenated format
With concatenated format, the R1 and R2 sequences are combined into a single sequence with R2 immediately following R1 (as opposed to merged or assembled). There is no spacer or filler sequence separating the reads; they are simply concatenated. Sometimes, the reads are truncated by quality filtering before they are concatenated, in which case they are pretty much useless because it is impossible to recover the original R1s and R2s. If some or all of the reads are full-length, then the R1s and R2s can be recovered by splitting the sequence (and quality scores) at the half-way point. With concatenated format, the read length must be specified by the -readlength option, e.g. 250. If the concatenated sequence is 2x the read length (e.g. 500) then it is split at the midpoint, otherwise it is discarded because it is impossible to determine where the R1 ends and the R2 begins.

Example

usearch -fastq_sra_splitpairs SRA457665.fastq -mode concatenated \
-readlength 250 -output1 fwd.fq -output2 rev.fq