The fastx_subsample command generates a random subset of sequences in a FASTA or FASTQ file. The subset is written to filename(s) given by the -fastaout and/or -fastqout options..
This command is useful for making fast assessments of large datasets, e.g. by analyzing a small sample of NGS reads, and for rarefaction analysis.
The size of the subset must be specified by the -sample_size or -sample_pct options, which give the number of sequences and percentage of the input sequences, respectively.
Size annotations are supported if the -sizein and -sizeout options are given.
If the -xsize option is given, any size annotations in the input sequence labels are stripped.
The -randseed option sets a seed for the random number generator, enabling reproducible subsets to be generated. By default, the seed is taken from the system clock so that in general the subset will change each time the command is run. The value must be an integer.
usearch -fastx_subsample raw_reads.fastq -sample_pct 10 -randseed 1 -fastaout ten_pct.fastq
usearch -fastx_subsample derep.fa -sizein -sizeout -sample_size
10000 -fastaout ten_k.fa