Using the tabbedout file to investigate merging problems

If the merge report shows that many reads are failing to merge for a given reason, then you can use the tabbedout file to investigate further. For example, suppose the report says that 70% of the pairs were discarded because of "too many diffs", i.e. mismatches in the alignments.

The simplest way to investigate is to use the -fastqout_notmerged_fwd and -fastqout_notmerged_rev options to get the pairs which did not merge, then (if needed) use fastx_subsample to get a small subset for manual investigation. See trouble-shooting merging for details.

If reads are failing to merge for two or more different reasons, then you can use the tabbedout file to get the subset of reads that is failing for one of those reasons, which may be convenient for further analysis in challenging cases.

The format of the tabbedout file is not documented in detail (and is subject to change in future usearch builds), but is fairly self-explanatory. Each read pair is one line in the file. The read label is the first field (truncated at the first space). Subsequent fields are separated by tabs. Each field reports the results of one step in the merging process, for example:

M00967:15:000000000-A2G1J:1:1101:18083:3926 aln=123-128-121 diffs=15 toomanydiffs result=notmerged

This shows that the pair failed to merge because there were too many (15) mismatches in the alignment. To get the read labels for all the reads that failed to merge for this reason, you can do this:

grep toomanydiffs tabbedout.txt | cut -f1 > toomanydiffs.labels

Then, to get the reads:

usearch -fastx_getseqs myreads_R1.fasta -labels toomanydiffs.labels -trunclabels -fastqout fwd.fq
usearch -fastx_getseqs myreads_R2.fasta -labels toomanydiffs.labels -trunclabels -fastqout rev.fq

The -trunclabels option is needed with typical Illumina reads because otherwise the labels will fail to match due to the suffixes 1:N:0.... and 2:N:0... which are added to the labels for the R1 and R2 reads, respectively.

Now you have a test set of read pairs which you can use to investigate further.