. 2012 Dec 5:13:324.

doi: 10.1186/1471-2105-13-324.

Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework

Steven Lewis¹, Attila Csordas, Sarah Killcoyne, Henning Hermjakob, Michael R Hoopmann, Robert L Moritz, Eric W Deutsch, John Boyle

Affiliations

PMID: 23216909
PMCID: PMC3538679
DOI: 10.1186/1471-2105-13-324

Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework

Steven Lewis et al. BMC Bioinformatics. 2012.

. 2012 Dec 5:13:324.

doi: 10.1186/1471-2105-13-324.

Authors

Steven Lewis¹, Attila Csordas, Sarah Killcoyne, Henning Hermjakob, Michael R Hoopmann, Robert L Moritz, Eric W Deutsch, John Boyle

Affiliation

¹ Institute for Systems Biology, Seattle, WA, USA. steven.lewis@systemsbiology.org

PMID: 23216909
PMCID: PMC3538679
DOI: 10.1186/1471-2105-13-324

Abstract

Background: For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed.

Results: We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed.

Conclusion: The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.

PubMed Disclaimer

Figures

**Figure 1**
**MapReduce jobs to generate a list of peptides to score at a specified m/z ratio.** The first mapper generates all possible sequences and modified sequences defined in the search parameters for a given fasta database. The reducer eliminates duplicates, remembers all source proteins and emits the peptide with m/z as the key. The next set of reducers collects all peptides to be scored against a given m/z and stores them in the database.

**Figure 2**
**MapReduce jobs to score measured spectra. Spectra are scored against the contents of the peptide database with a series of m/z values.** In the next job all scores are combined to generate the best scores. As a single file is the desired output, the last job has a single reducer allowing all output to go to a single file.

**Figure 3**
**Search times as a function of job complexity.** Complexity is measured as dot products - the score of one spectrum against one peptide. Complexity depends on the number of spectra, the size of the protein database and the modifications and cleavages searched. The measured spectra files used for benchmarking our implementation were picked out of the public experiments of the PRIDE (Proteomics Identifications Database) proteomics repository. The PRIDE accession numbers of the 3 experiments used for making Figure 3 are: 7962, 15459, 10295. The PRIDE xml files containing spectra were downloaded from the PRIDE website and were opened in the PRIDE Inspector [19]. The mgf export functionality of PRIDE Inspector was used to generate the mgf files used in the searches, with only human tissue samples or cell lines being used to generate the mass spectra.

**Figure 4**
**Showing database build time as a function of the number of peptides cataloged.** The figure shows the time for building a tryptic database against the number of peptides. The data is for a tryptic database with limited modifications. Build times are higher with semitryptic builds or with more modifications. Build times for tryptic digests range from a few minutes, largely representing set up time, to under an hour for the largest databases with over a million proteins.

See this image and copyright information in PMC

References

1. Mann M, Wilm M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem. 1994;66(24):4390–4399. doi: 10.1021/ac00096a002. - DOI - PubMed
1. Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 2004;20(9):1466–1467. doi: 10.1093/bioinformatics/bth092. - DOI - PubMed
1. Eng J, McCormack A, Yates J. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1994;5(11):976–989. doi: 10.1016/1044-0305(94)80016-2. - DOI - PubMed
1. Geer LY. et al.Open mass spectrometry search algorithm. J proteome Res. 2004;3(5):958–964. doi: 10.1021/pr0499491. - DOI - PubMed
1. Baumgardner L. et al.Fast parallel tandem mass spectral library searching using GPU hardware acceleration. J Proteome Res. 2011;10(6):2882–2888. doi: 10.1021/pr200074h. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

R01CA137442/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework

Affiliation

Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources