Parallelization of MAFFT for large-scale multiple sequence alignments

Tsukasa Nakamura^{1

2}, Kazunori D Yamada^{2

3}, Kentaro Tomii^{1

2

4

5}, Kazutaka Katoh^{2

6}

Affiliations

¹ Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.
² Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.
³ Graduate School of Information Sciences, Tohoku University, Sendai, Japan.
⁴ Biotechnology Research Institute for Drug Discovery (BRD), AIST, Tokyo, Japan.
⁵ AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), Tokyo, Japan.
⁶ Research Institute for Microbial Diseases, Osaka University, Suita, Japan.

PMID: 29506019
PMCID: PMC6041967
DOI: 10.1093/bioinformatics/bty121

Parallelization of MAFFT for large-scale multiple sequence alignments

Tsukasa Nakamura et al. Bioinformatics. 2018.

. 2018 Jul 15;34(14):2490-2492.

doi: 10.1093/bioinformatics/bty121.

Authors

Tsukasa Nakamura^{1

2}, Kazunori D Yamada^{2

3}, Kentaro Tomii^{1

2

4

5}, Kazutaka Katoh^{2

6}

Affiliations

¹ Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.
² Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.
³ Graduate School of Information Sciences, Tohoku University, Sendai, Japan.
⁴ Biotechnology Research Institute for Drug Discovery (BRD), AIST, Tokyo, Japan.
⁵ AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), Tokyo, Japan.
⁶ Research Institute for Microbial Diseases, Osaka University, Suita, Japan.

PMID: 29506019
PMCID: PMC6041967
DOI: 10.1093/bioinformatics/bty121

Abstract

Summary: We report an update for the MAFFT multiple sequence alignment program to enable parallel calculation of large numbers of sequences. The G-INS-1 option of MAFFT was recently reported to have higher accuracy than other methods for large data, but this method has been impractical for most large-scale analyses, due to the requirement of large computational resources. We introduce a scalable variant, G-large-INS-1, which has equivalent accuracy to G-INS-1 and is applicable to 50 000 or more sequences.

Availability and implementation: This feature is available in MAFFT versions 7.355 or later at https://mafft.cbrc.jp/alignment/software/mpi.html.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
(a) QuanTest. Accuracy of protein secondary structure prediction based on various sizes of MSAs by G-large-INS-1 (red bold lines), G-INS-1 (version 7.245; blue bold lines) and other popular methods. We used 1940 (out of 2265) entries so that JPred (Drozdetskiy *et al.*, 2015) can be consistently applied to the MSAs by all methods. (b)–(g), Parallelization efficiency of all-to-all alignment stage (b, d and f) and progressive stage (c, e and g) when applying G-large-INS-1 to LSU rRNA (b, c) sdr (d, e) and zf-CCHH (f, g). Green squares and magenta triangles are the computational time on NFS and Lustre filesystem, respectively. Lines are the expected time based on the cases using seven cores [NFS; green solid lines in (b), (d) and (f)], 35 cores [Lustre; magenta dotted lines in (b), (d) and (f)] and single core (c, e and g), assuming a perfect efficiency. The calculations with NFS (green) were performed on a heterogeneous cluster system (each node has 16–20 cores of Intel Xeon E5-2660 v3 2.6 GHz, E5-2680 2.7 GHz and E5-2670 v2 2.50 GHz with 64–128GB RAM). The calculations with the Lustre filesystem (magenta) were performed on Intel Xeon E5-2695 v4 2.10 GHz 36 cores with 256GB RAM per node using Lustre version 2.5.42

See this image and copyright information in PMC

References

1. Boyce K. et al. (2015) Instability in progressive multiple sequence alignment algorithms. Algorithms Mol Biol, 10, 26.. - PMC - PubMed
1. Drozdetskiy A. et al. (2015) JPred4: a protein secondary structure prediction server. Nucleic Acids Res., 43, W389–W394. - PMC - PubMed
1. Fox G. et al. (2016) Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments. Bioinformatics, 32, 814–820. - PMC - PubMed
1. Glöckner F.O. et al. (2017) 25 years of serving the community with ribosomal RNA gene reference databases and tools. J. Biotechnol., 261, 169–176. - PubMed
1. González-Domínguez J. et al. (2016) MSAProbs-MPI: parallel multiple sequence aligner for distributed-memory systems. Bioinformatics, 32, 3826–3828. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Parallelization of MAFFT for large-scale multiple sequence alignments

Affiliations

Parallelization of MAFFT for large-scale multiple sequence alignments

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources