NAST alignment format
Home Software Services About Contact     
Follow on twitter

Robert C. Edgar on twitter

11-Aug-2018 New paper describes octave plots for visualizing alpha diversity.

12-Jun-2018 New paper shows that one in five taxonomy annotations in SILVA and Greengenes are wrong.

18-Apr-2018 New paper shows that taxonomy prediction accuracy is <50% for V4 sequences.

05-Oct-2017 PeerJ paper shows low accuracy of closed- and open-ref. QIIME OTUs.

22-Sep-2017 New paper shows 97% threshold is wrong, OTUs should be 99% full-length 16S, 100% for V4.

UPARSE tutorial video posted on YouTube. Make OTUs from MiSeq reads.



NAST alignment format

See also
nastout files

NAST (Nearest Alignment Space Termination) is a multiple alignment format originally designed for 16S rRNA, though the approach can readily be adapted to other genes and regions.

NAST was introduced in a paper by DeSantis et al., NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes, Nucl. Acids Res. (2006) 34 (suppl 2): W394-W399. At least three  third-party aligners, PyNAST, NAST-ier and mothur, are also available.

The main idea of NAST is to create a reference multiple alignment with a fixed number of columns that does not change as new sequences are introduced. Columns in a NAST alignment serve as fixed reference points for a set of homologous sequences, e.g. 16S genes. Similar ideas have been applied to other genes, e.g. IMGT unique numbering for immunoglobulins (PMID 12477501). Lacking a better name, I generically refer to this approach as "NAST".

A new sequence can be aligned to a reference alignment relatively easily, by identifying the closest sequence or closest few sequences. A pair-wise alignment or small multiple alignment is then made, which can readily be mapped back to the full reference alignment. This approach allows new sequences to be annotated with features, e.g. hypervariable regions, using a pre-defined map of features to column numbers. Given the very large datasets now available for 16S, immunoglobulins and other genes and regions, some traditional methods are computationally intractable, while a NAST alignment enables efficient calculation of pair-wise distances, identification of chimeric sequences, etc.

There are two main disadvantages of NAST-like approaches. First, novel insertions cannot be accommodated correctly because the format has a fixed number of columns by definition. Therefore, novel insertions must be deleted (this is the solution adopted in USEARCH), or misalignments must be introduced. Neither solution is entirely satisfactory, for obvious reasons. Second, some (if not most) genes and regions are simply too variable, making it impossible to build a reasonable multiple alignment, e.g. the fungal Internal Transcribed Spacer (ITS) region.

I believe that better results can usually be obtained by constructing pair-wise or multiple alignments of subsets de novo, as done for example by the uchime_ref command (compare ChimeraSlayer and the mothur Chimera.slayer command which use a NAST-based method but are slower and less accurate). However, NAST methods can be convenient in some situations, especially where there is existing data and annotations that rely on 16S NAST or a NAST-like fixed column scheme such as IMGT numbering.