How important is it to re-train for my region / length?
Why not use a larger reference database like Greengenes or SILVA?
To create your own reference data, there are three main steps: 1. create a FASTA file with utax-compatible taxonomy annotations, 2. trim the sequences to the appropriate region, and 3. train parameters on a subset without placeholder names (like sp. for species or incertae sedis).
Do you really need to train your own parameters
for 16S or ITS?
If you have 16S or ITS data for a region or read length which is not included in the pre-trained taxconfs file, there may be no need to train your own parameters. See Should I re-train for my region / length? for discussion.
Preparing a UTAX-compatible FASTA
You'll need to write some code to convert existing sequence and taxonomy data to UTAX-compatible FASTA. See utax taxonomy annotations for discussion of the format requirements.
Within reason, it is ok to have identical sequences in the reference set, and it is also ok if they have different taxonomies. However, if there are large numbers of identical sequences this can degrade performance. I therefore recommend that you dereplicate the database and assign the consensus taxonomy to each unique sequence (i.e., delete ranks that are not identical in all copies of a unique sequence). You can use the -constax option of derep_fulllength or derep_prefix to do this:
usearch -derep_prefix ref.fa -constax -fastaout ref_uniques.fa -constax_report report.txt
If you trim the database to a shorter region, then you should dereplicate after trimming (because non-identical sequences may become identical when they are trimmed).
Trimming the FASTA file
Typically, the sequences you want to classify are shorter than your reference sequences. If so, you should train UTAX parameters on sequences which have been trimmed to the relevant region. For classification, you can use full-length or trained sequences, but for training you should use trimmed sequences.
I can suggest a couple of different methods for trimming: 1. the search_pcr command using the PCR primers, and 2. the qsegout option of usearch_global, using the reads as the search database and the reference sequences as the query.
Trimming using search_pcr
If you have the PCR primer sequences, and the sequences you want to classify cover the full length of an amplicon, then you can use the ampout option of search_pcr to extract the amplified region, like this:
usearch -search_pcr ref.fa -db primers.fa -strand plus -ampout ref_trimmed.fa -pcr_strip_primers
If your reads are shorter than the amplicons so don't reach the reverse primer, then you can use the -userout file to get the coordinates of the primer-matching region in the reference sequence and write a script to extract the reference sequence segment starting at this position.
Trimming using usearch_global
An alternative which can be convenient, especially if you don't know the primer sequences, is to align reference sequences to the reads and extract the aligned segment using the the qsegout option of usearch_global. To make a search database of tractable size from the reads, you could dereplicate using the derep_fulllength command and discard singletons by using -minuniquesize 2. If the database is still too large, you could cluster at a lower identity, say 97%, using cluster_fast or set a higher abundance threshold, say -minuniquesize 10. Once you have a tractable read database, you can use usearch_global to extract the matching segment of the reference sequences, e.g.:
usearch -usearch_global ref.fa -db readdb.fa -strand both -id 0.7 -qsegout ref_trimmed.fa
Here, ref.fa is a FASTA file containing the reference sequences with taxonomy annotations and readdb.fa is the database constructed from the reads. Typically the reads will only match a fraction of the reference database, so you should use as diverse a set of reads as possible (e.g., don't use a mock community). You can extend the subset by repeating the process using the trimmed set as a query:
usearch -usearch_global ref.fa -db ref_trimmed.fa -strand both -id 0.7 -qsegout ref_trimmed2.fa
You can iterate this several times. Another approach for increasing the coverage is to extend the set of reads. For example, you can do this:
usearch -usearch_global all_reads.fa -db readdb.fa -strand plus -id 0.7 -notmatched not.fa
This gives a new set not.fa which is <70% similar to the sequences in the original readdb.fa database. Now you can dereplicate the reads in not.fa and use this as a second reference:
usearch -derep_fulllength not.fa -fastaout notd.fa -minuniquesize 2
usearch -usearch_global ref.fa -db notd.fa-strand both -id 0.7 -qsegout ref_trimmed_new.fa
You would then need to merge ref_trimmed_new.fa with ref_trimmed.fa to delete duplicated sequences, which you can do using derep_fulllength.
Exclude placeholder names
For training, sequences with taxonomic names which do not correspond to monophyletic clades should be excluded. The most common such name is "sp" for "unknown / unclassified species". These will confuse the training algorithm, which will assume that two sequences annotated as, say, s:Clostridium_sp are the same species, when in fact they could be the same species or two different species. Other common placeholders are incertae sedis and (less common) variants such as incerti ordinis and incerti subordinis.
In the RDP 16S training set, Chloroplast degrades parameters (I'm not sure if it the clades are multiphyletic or there is some other reason), so should be excluded.
Placeholder names should be excluded for training, but will not degrade classification accuracy, so for best results training should be done with placholders excluded and classification should be done with the full set.
The fastx_getseqs command can be used to obtain a subset without placeholder names. The -label_words option is used to specify a text file name containing the words to be excluded, the -label_field tax option specifies that only the "tax=..." taxonomy annotation should be matched, and the -notmatched option is used to get sequences which do not match the words. For example;
usearch -fastx_getseqs ref.fa -label_words placeholder_names.txt -label_field tax -notmatched refx.fa -matched excluded.fa
A suggested starting point for a list of placeholder names is:
Let me know if you have other suggestions. You should also look for warnings "taxonomy nodes have >1 parent" when training. If you do get it, see the log file (-log filename) to get the names. These names are probably placeholders, and either way should be excluded.
As with placeholder names, non-unique names cause problems for training. In bacteria a given species name is sometimes found in more than one genus, for example Salinispora tropica and Pseudonocardia tropica. If you are training to species level this will cause a warning " taxonomy nodes have >1 parent". You can check the log file (-log usearch.log option) to see which taxa have this problem. The way I usually fix this is to add the genus name to the species name, like this:
This is only needed during training, it doesn't matter either way if non-unique names are present in your reference database for classification.
Adding "decoy" sequences
With the standard training procedure, the prediction accuracy of the topmost levels may be degraded if there are no similar sequences to act as "decoys" (potential false positives that have detectable similarity to the training set). For example, with ITS, all training sequences belong to the Fungi domain and the only decoys are reversed sequences which have very low similarity. With 16S, all sequences are Bacteria or Archaea. A lack of decoy examples can cause a classifier to assign over-optimistic confidence values. A striking example of this is the RDP Classifier, which often assigns high confidence to false-positive domain predictions on sequences that in fact are not homologous to any SSU region.
To mitigate this problem, you can add a decoy set. This should be chosen to be as closely related as possible to the top level (or levels) in your training set. For example, if you are training 16S, then you could add 18S sequences because 18S is homologous to 16S. You should only annotate the top level(s) where you need decoys. For example, if you are adding 18S to a 16S training set, then the annotation for the 18S sequences should be tax=d:eukaryota; without lower levels such as phylum or class. Then d:eukaryota will act a decoy for d:Bacteria and d:Archaea, but parameters for lower levels will be specific to 16S.
Training on the trimmed set with
The next step is to generate a taxconfs file by training on the trimmed set without placeholder names. Use the utax_train command with the -taxconfsout option to save the parameters, for example:
usearch -utax_train refx.fa -taxconfsout user.tc -utax_splitlevels NVpcofgs -utax_trainlevels dpcofgs
See taxonomy parameter training for discussion of how to set the utax_splitlevels and utax_trainlevels options.
Building a .udb database
The final step is to build a .udb database using the makeudb_utax command. You can use trimmed or full-length sequences here; I recommend using trimmed sequences as they may give slightly better results. Placeholder names do no harm here, so you should include them to get the best possible coverage. Use the -taxconfsin option with the parameter file you generated in the previous step (training on trimmed sequences with placeholders excluded). Since you are using -taxconfsin, the makeudb_utax command does not re-train and you don't need to specify the -utax_splitlevels or -utax_trainlevels options. A typical command-line looks like this:
usearch -makeudb_utax ref.fa -taxconfsin user.tc -output ref.udb
This database is now ready for running the utax command, e.g.:
usearch -utax reads.fastq -db ref.udb
-strand both -utaxout reads.utax