Note that order and genus are not present.
Commas and semi-colons are not
permitted in taxon names. They could be replace by another punctuation
character, e.g. underscore. Otherwise, any character other than end-of-line is
allowed, including colons. Parentheses ( ... ) are allowed but are discouraged
as they may confuse scripts which parse taxonomy
Names which are entirely blank or empty
must not be included. These are indicated by omitting the rank entirely. For
example, the Greengenes convention is to include all ranks but not specify the
name, as in:
f__Mycobacteriaceae; g__Mycobacterium; s__
(HIgher levels omitted for brevity). In UTAX notation, the
species would be omitted:
White space (blanks and tabs) is
allowed within a taxon name, but otherwise is not allowed in the taxonomy
annotation, so for example a space after a comma is not allowed. If white space is present, the the ‑notrunclabels
option must be used when the reference database is in FASTA format. The -notrunclabels
option is not required if the database has been converted to .udb format using
the makeudb_utax command.
Non-ASCII characters (8-bit or
Unicode) should not be used.
These are sometimes found in clade names, e.g. genus Heteroepichloë in UNITE
(notice the dots over the last e). Like most command-line informatics programs,
USEARCH assumes plain old 7-bit ASCII text files and does not understand other
formats such as UTF-8 which
have ASCII as a subset. See here for a python
script that finds non-ASCII characters in a text file. I use this script to find
names with non-ASCII characters then manually edit, e.g. replace ë by e.
The complete set of clade names and
their parent-child relationships is implicitly
specified by the annotations in a FASTA file. No separate file is
required to specify other information about the taxonomy. The taxonomy
need not be fully consistent with a tree, meaning that some taxa may have more
than one parent, e.g. a given family name may appear in two different orders.
This allows the use of names such as "incertae sedis", "sp." and "unknown" which
do not correspond to true taxa. However, such names should be excluded for
training; see training on user data for
tax_stats command can be used to check that annotations are correctly