fastq_mergepairs command

New reporting options in v8.1.1859: -report and -tabbedout.

Performs merging of paired reads. (This is sometimes called 'assembly' of paired reads, but I find this term confusing because assembly usually refers to making longer contigs, so I prefer to call it merging).

Typical usage:

usearch -fastq_mergepairs *_R1_*.fastq -fastqout merged.fq -relabel @

The -fastq_merge_maxee option can be used to set an expected errors threshold, but for OTU clustering quality filtering is generally performed as a post-processing step:

usearch -fastq_filter merged.fq -fastq_maxee 1.0 -fastqout filtered.fq

This is because filtered reads are used to construct OTUs but unfiltered reads are used to construct the OTU table, so merged reads before and after filtering are both needed. See UPARSE tutorials for some examples.

Merged reads are written to -fastqout (for FASTQ) and / or -fastaout (for FASTA). Reads which failed to merge are written to ‑fastqout_notmerged_fwd, -fastqout_notmerged_rev, -fastaout_notmerged_fwd, -fastaout_notmerged_rev.

The ‑eetabbedout.output file is a tabbed text file which reports the expected errors for each merged read pair. The -tabbedout option (v8.1.1859) gives much more information about each pair so -eetabbedout is deprecated.

The -report filename option gives summary information, click here for an example report. This example shows that there are several anomalously short pairs with merged lengths in the range 20-30, much shorter than the mean (330nt) which suggests using ‑fastq_minmergelen to filter them out.

Several forward FASTQ filenames may be given following the -fastq_mergepairs option (v.8.1.1800 and later). This allows you to use shell wildcards to merge several pairs of files in a single step. If you use this feature, you will typically want to use the -relabel @ feature (see below) to label the merged reads with the sample name.

The FASTQ filename for the forward reads (R1s) is specified by the -fastq_mergepairs option, and the reverse read filename (R2s) is specified by the ‑reverse option. If the -reverse option is not given, the reverse read filename is constructed by replacing _R1 with _R2 in the forward filename (supported in v.8.1.1800 and later).

The -relabel string option specifies that the read labels should be changed in the output files. Labels are made by appending an integer 1, 2, 3... to the string. Only reads that are successfully merged are counted, so there are no gaps in the numbering. The special value @ indicates that the string should be constructed from the file name by truncating the file name at the first underscore or period and appending a period (supported in v.8.1.1800 and later). With a typical Illumina FASTQ file name, this gives the sample name. So, for example, if the R1 file name is Mock_S188_L001_R1_001.fastq, then the string is Mock and the output labels will be Mock.1, Mock.2 etc.

The -sample string option specifies that sample=string; should be added to the read label (supported in v.8.1.1800 and later).

Forward and reverse reads must be in 1:1 correspondence and must appear in the same order in both files. The labels for the forward and reverse read in a given pair must be identical, or identical except for a single position where a '1' appears in the forward read label and a '2' appears in the reverse read label.

Option		Description
fastq_minovlen k		Minimum length of the overlap. Default: no minimum. Note: overlaps shorter than the -minhsp option will fail to align, so this option also has the effect of imposing a minimum overlap. You should therefore set -minhsp to the shortest overlap that you expect in your data. Values less than 8 may cause performance problems, and may cause spurious overlaps.
fastq_minmergelen L		Minimum length of the merged read. Default: no minimum.
fastq_maxmergelen L		Maximum length of the merged read. Default: no maximum.
fastq_maxdiffs n fastq_maxdiffpct n		fastq_maxdiffs sets the maximum number of mismatches allowed in the overlap region. Default: 5 (v8.1.1856 or later; earlier versions did not set a maximum by default). fastq_maxdiffpct sets the maximum fraction of mismatches allowed in the overlap region, expressed as an integer percentage. Default 5 (v8.1.1856 or later).
fastq_maxgaps n		(v8.0.1616 or later). Maximum number of gaps allowed in the alignment of the overlapping region. Default is 0, because gaps are very rare with Illumina paired reads so gaps are more likely to indicate a misalignment than a read error. Also, the merge is ambiguous in a gapped column: should the base be included or not? There is no Q score to decide which read is better in this case.
fastq_merge_maxee e		Discard pairs if the number of expected errors is > e after merging. By default, no expected error filtering is performed. (Requires v8.0.1610 or later. With earlier versions, you can use fastq_filter to perform e.e. filtering on the merged reads).
fastq_trunctail q		This option is provided for older Illumina data with "#" tails, it is not recommended for newer reads. It truncates the forward and reverse reads at the first Q<=q, if present. This truncation is performed before aligning the pair. This option is provided for older Illumina reads where Q=2 was used to indicate a bad tail. For such reads, it is recommended to use ‑fastq_truntail 2 or higher, as low-quality tails often caused alignments to fail. Default: no quality truncation as this is not needed with newer reads.
fastq_minlen L		Minimum length of the forward and reverse read, after truncating per ‑fastq_truncqual if applicable. Default: no minimum.
fastq_nostagger		Do not merge a pair where the alignment is "staggered" like this: --FORWARD REVERSE-- Staggered alignments are generated when the template sequence is shorter than the read length, causing the read to extend into the opposite sequencing primer and adapter. See read layouts. \By default, pairs with staggered alignments are merged and trimmed. Trimming removes letters in "overhangs" that align to terminal gaps, in the above example RE would be trimmed from the reverse read and RD would be trimmed from the forward read.
fastq_eeout		Append "ee=xxx;" annotation to the read labels giving the expected errors after merging. (Requires v8.0.1610 or later).
fastqout_notmerged_fwd fastqout_notmerged_rev fastaout_notmerged_fwd fastaout_notmerged_rev		Filenames for forward and reverse reads that are not merged (FASTQ or FASTA format).