Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jul 15;34(14):2490-2492.
doi: 10.1093/bioinformatics/bty121.

Parallelization of MAFFT for large-scale multiple sequence alignments

Affiliations

Parallelization of MAFFT for large-scale multiple sequence alignments

Tsukasa Nakamura et al. Bioinformatics. .

Abstract

Summary: We report an update for the MAFFT multiple sequence alignment program to enable parallel calculation of large numbers of sequences. The G-INS-1 option of MAFFT was recently reported to have higher accuracy than other methods for large data, but this method has been impractical for most large-scale analyses, due to the requirement of large computational resources. We introduce a scalable variant, G-large-INS-1, which has equivalent accuracy to G-INS-1 and is applicable to 50 000 or more sequences.

Availability and implementation: This feature is available in MAFFT versions 7.355 or later at https://mafft.cbrc.jp/alignment/software/mpi.html.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(a) QuanTest. Accuracy of protein secondary structure prediction based on various sizes of MSAs by G-large-INS-1 (red bold lines), G-INS-1 (version 7.245; blue bold lines) and other popular methods. We used 1940 (out of 2265) entries so that JPred (Drozdetskiy et al., 2015) can be consistently applied to the MSAs by all methods. (b)–(g), Parallelization efficiency of all-to-all alignment stage (b, d and f) and progressive stage (c, e and g) when applying G-large-INS-1 to LSU rRNA (b, c) sdr (d, e) and zf-CCHH (f, g). Green squares and magenta triangles are the computational time on NFS and Lustre filesystem, respectively. Lines are the expected time based on the cases using seven cores [NFS; green solid lines in (b), (d) and (f)], 35 cores [Lustre; magenta dotted lines in (b), (d) and (f)] and single core (c, e and g), assuming a perfect efficiency. The calculations with NFS (green) were performed on a heterogeneous cluster system (each node has 16–20 cores of Intel Xeon E5-2660 v3 2.6 GHz, E5-2680 2.7 GHz and E5-2670 v2 2.50 GHz with 64128GB RAM). The calculations with the Lustre filesystem (magenta) were performed on Intel Xeon E5-2695 v4 2.10 GHz 36 cores with 256GB RAM per node using Lustre version 2.5.42

References

    1. Boyce K. et al. (2015) Instability in progressive multiple sequence alignment algorithms. Algorithms Mol Biol, 10, 26.. - PMC - PubMed
    1. Drozdetskiy A. et al. (2015) JPred4: a protein secondary structure prediction server. Nucleic Acids Res., 43, W389–W394. - PMC - PubMed
    1. Fox G. et al. (2016) Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments. Bioinformatics, 32, 814–820. - PMC - PubMed
    1. Glöckner F.O. et al. (2017) 25 years of serving the community with ribosomal RNA gene reference databases and tools. J. Biotechnol., 261, 169–176. - PubMed
    1. González-Domínguez J. et al. (2016) MSAProbs-MPI: parallel multiple sequence aligner for distributed-memory systems. Bioinformatics, 32, 3826–3828. - PubMed

Publication types

MeSH terms