NAST alignment format
NAST (Nearest Alignment Space Termination) is a multiple
alignment format originally designed for 16S rRNA, though the approach can readily be
adapted to other genes and regions.
NAST was introduced in a paper by DeSantis et al.,
multiple sequence alignment server for comparative analysis of 16S rRNA genes,
Nucl. Acids Res. (2006) 34 (suppl 2):
W394-W399. At least two third-party aligners,
The main idea of NAST is to create a reference multiple
alignment with a fixed number of columns that does not change as new sequences
are introduced. Columns in a NAST alignment serve as fixed reference points for
a set of homologous sequences, e.g. 16S genes. Similar ideas have been applied
to other genes, e.g.
IMGT unique numbering for immunoglobulins (PMID
12477501). Lacking a better name, I generically refer to this approach as
A new sequence can be aligned to a reference alignment
relatively easily, by identifying the closest sequence or closest few sequences.
A pair-wise alignment or small multiple alignment is then made, which can
readily be mapped back to the full reference alignment. This approach allows new
sequences to be annotated with features, e.g. hypervariable regions, using a
pre-defined map of features to column numbers. Given the very large datasets now
available for 16S, immunoglobulins and other genes and regions, some traditional
methods are computationally intractable, while a NAST alignment enables
efficient calculation of pair-wise distances, identification of chimeric
There are two main disadvantages of
NAST-like approaches. First, novel insertions cannot be accommodated correctly
because the format has a fixed number of columns by definition. Therefore,
novel insertions must be deleted (this is the solution adopted in USEARCH), or
misalignments must be introduced. Neither solution is entirely satisfactory, for
obvious reasons. Second, some (if not most) genes and regions are simply too
variable, making it impossible to build a reasonable multiple alignment, e.g.
the fungal Internal Transcribed Spacer (ITS) region.
I believe that better results can
usually be obtained by constructing pair-wise or multiple alignments of subsets
de novo, as done for example by the
uchime_ref command (compare
ChimeraSlayer and the
command which use a NAST-based method but are slower and less accurate).
However, NAST methods can be convenient in some situations, especially where
there is existing data and annotations that rely on 16S NAST or a NAST-like
fixed column scheme such as IMGT numbering.