USEARCH manual

fastx_uniques command

Find the set of unique sequences in an input file, also called dereplication. Input is a FASTA or FASTQ file. Sequences are compared letter-by-letter and must be identical over the full length of both sequences (substrings do not match). Case is ignored, so an upper-case letter matches a lower-case letter.

All 26 letters of the English alphabet are treated in the same way, so there is no concept of a biological alphabet or of wildcard matching (unless strand -both is used).

Multithreading is supported.

The ‑fastaout option specifies a FASTA output file for the unique sequences. Sequences are sorted by decreasing abundance.

The ‑fastqout option specifies a FASTQ output file for the unique sequences. Sequences are sorted by decreasing abundance.

The ‑uc output file is supported, but not other standard output files.

The ‑sizeout option specifies that size annotations should be added to the output sequence labels.

The -relabel option specifies a string that is used to re-label the dereplicated sequences. An integer is appended to the label.
E.g., -relabel Uniq will generate sequences labels Uniq1, Uniq2 ... etc. By default, the label of the first occurrence of the sequence is used.

The ‑minuniquesize option sets a minimum abundance.

The -topn N option specifies that only the first N sequences in order of decreasing abundance will be written to the output file.

Reverse-complemented matching for nucleotide sequences is supported by specifying -strand both.

Example

usearch -fastx_uniques input.fasta -fastaout uniques.fasta -sizeout -relabel Uniq