FASTQ format options
Paper on merging and filtering
(Edgar & Flyvbjerg, 2015)
Paired read assembler and quality filtering benchmark
New reporting options in v8.1.1859: -report and
Performs merging of paired reads.
(This is sometimes called 'assembly' of paired reads, but I find this term
confusing because assembly usually refers to making longer contigs, so I prefer to call it
usearch -fastq_mergepairs *_R1_*.fastq
-fastqout merged.fq -relabel @
The -fastq_merge_maxee option can be used to set an
expected errors threshold, but for OTU clustering
quality filtering is generally performed as a post-processing step:
usearch -fastq_filter merged.fq -fastq_maxee 1.0 -fastqout
This is because filtered reads are used to construct OTUs but
unfiltered reads are used to construct the OTU
table, so merged reads before and after filtering are both needed. See
UPARSE tutorials for some examples.
Merged reads are written to -fastqout (for FASTQ) and / or -fastaout (for FASTA).
Reads which failed to merge are written to ‑fastqout_notmerged_fwd,
file is a tabbed text file which reports the expected errors
for each merged read pair. The -tabbedout option (v8.1.1859) gives much more
information about each pair so -eetabbedout is deprecated.
The -report filename option gives summary
information, click here for an example report.
This example shows that there are several anomalously short pairs with merged
lengths in the range 20-30, much shorter than the mean (330nt) which suggests
‑fastq_minmergelen to filter them out.
Several forward FASTQ filenames may be given following the
-fastq_mergepairs option (v.8.1.1800 and later). This allows you to use shell
wildcards to merge several pairs of files in a single step. If you use this
feature, you will typically want to use the -relabel @ feature (see below) to
label the merged reads with the sample name.
filename for the forward reads (R1s) is specified by the -fastq_mergepairs option, and
the reverse read filename (R2s) is specified by the ‑reverse option. If the -reverse option is not given, the reverse read
filename is constructed by replacing _R1 with _R2 in the forward filename (supported in v.8.1.1800 and
The -relabel string option specifies that the
read labels should be changed in the output files. Labels are made by appending
an integer 1, 2, 3... to the string. Only reads that are successfully merged are
counted, so there are no gaps in the numbering. The special value @ indicates
that the string should be constructed from the file name by truncating the file
name at the first underscore or period and appending a period (supported in
v.8.1.1800 and later). With a typical Illumina FASTQ file name, this gives the
sample name. So, for example, if the R1 file name is
Mock_S188_L001_R1_001.fastq, then the string is Mock and the output labels will
be Mock.1, Mock.2 etc.
The -sample string option specifies that sample=string;
should be added to the read label (supported in v.8.1.1800 and later).
Forward and reverse reads must be in 1:1 correspondence and
must appear in the same order in both files. The labels for the forward and
reverse read in a given pair must be identical, or identical except for a single position
where a '1' appears in the forward read label and a '2' appears in the reverse
Minimum length of the overlap. Default: no minimum.
Note: overlaps shorter than the -minhsp
option will fail to align, so this option also has the effect of imposing a
minimum overlap. You should therefore set -minhsp to the shortest overlap that
you expect in your data. Values less than 8 may cause performance problems, and
may cause spurious overlaps.
Minimum length of the merged read. Default: no minimum.
Maximum length of the merged read. Default: no maximum.
fastq_maxdiffs sets the maximum number of mismatches allowed in the overlap
region. Default: 5 (v8.1.1856 or later; earlier versions did not set a maximum
fastq_maxdiffpct sets the maximum fraction of mismatches
allowed in the overlap region, expressed as an integer percentage. Default 5
(v8.1.1856 or later).
(v8.0.1616 or later). Maximum number of gaps allowed in the alignment of the
overlapping region. Default is 0, because gaps are very rare with Illumina
paired reads so gaps are more likely to indicate a misalignment than a read
error. Also, the merge is ambiguous in a gapped column: should the base be
included or not? There is no Q score to decide which read is better in this
Discard pairs if the number of
expected errors is > e after merging. By default, no expected error
filtering is performed. (Requires v8.0.1610 or later. With earlier versions, you
can use fastq_filter to perform e.e.
filtering on the merged reads).
This option is provided for older Illumina data with "#"
tails, it is not recommended for newer reads.
It truncates the forward and reverse reads at the first Q<=q,
if present. This truncation is performed before aligning the pair. This option
is provided for older Illumina reads where Q=2 was used to indicate a bad tail.
For such reads, it is recommended to use ‑fastq_truntail 2 or higher, as
low-quality tails often caused alignments to fail. Default: no quality
truncation as this is not needed with newer reads.
Minimum length of the forward and reverse read, after
truncating per ‑fastq_truncqual if applicable. Default: no minimum.
Do not merge a pair where the alignment is "staggered"
Staggered alignments are generated when the template
sequence is shorter than the read length, causing the read to extend into the
opposite sequencing primer and adapter. See read
\By default, pairs with staggered alignments are
merged and trimmed. Trimming removes letters in "overhangs" that align to
terminal gaps, in the above example RE would be trimmed from the reverse read
and RD would be trimmed from the forward read.
Append "ee=xxx;" annotation to the read labels giving the
expected errors after merging. (Requires v8.0.1610
Filenames for forward and reverse reads that are not
merged (FASTQ or FASTA format).