Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Dec 5:13:324.
doi: 10.1186/1471-2105-13-324.

Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework

Affiliations

Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework

Steven Lewis et al. BMC Bioinformatics. .

Abstract

Background: For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed.

Results: We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed.

Conclusion: The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.

PubMed Disclaimer

Figures

Figure 1
Figure 1
MapReduce jobs to generate a list of peptides to score at a specified m/z ratio. The first mapper generates all possible sequences and modified sequences defined in the search parameters for a given fasta database. The reducer eliminates duplicates, remembers all source proteins and emits the peptide with m/z as the key. The next set of reducers collects all peptides to be scored against a given m/z and stores them in the database.
Figure 2
Figure 2
MapReduce jobs to score measured spectra. Spectra are scored against the contents of the peptide database with a series of m/z values. In the next job all scores are combined to generate the best scores. As a single file is the desired output, the last job has a single reducer allowing all output to go to a single file.
Figure 3
Figure 3
Search times as a function of job complexity. Complexity is measured as dot products - the score of one spectrum against one peptide. Complexity depends on the number of spectra, the size of the protein database and the modifications and cleavages searched. The measured spectra files used for benchmarking our implementation were picked out of the public experiments of the PRIDE (Proteomics Identifications Database) proteomics repository. The PRIDE accession numbers of the 3 experiments used for making Figure 3 are: 7962, 15459, 10295. The PRIDE xml files containing spectra were downloaded from the PRIDE website and were opened in the PRIDE Inspector [19]. The mgf export functionality of PRIDE Inspector was used to generate the mgf files used in the searches, with only human tissue samples or cell lines being used to generate the mass spectra.
Figure 4
Figure 4
Showing database build time as a function of the number of peptides cataloged. The figure shows the time for building a tryptic database against the number of peptides. The data is for a tryptic database with limited modifications. Build times are higher with semitryptic builds or with more modifications. Build times for tryptic digests range from a few minutes, largely representing set up time, to under an hour for the largest databases with over a million proteins.

References

    1. Mann M, Wilm M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem. 1994;66(24):4390–4399. doi: 10.1021/ac00096a002. - DOI - PubMed
    1. Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 2004;20(9):1466–1467. doi: 10.1093/bioinformatics/bth092. - DOI - PubMed
    1. Eng J, McCormack A, Yates J. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1994;5(11):976–989. doi: 10.1016/1044-0305(94)80016-2. - DOI - PubMed
    1. Geer LY. et al.Open mass spectrometry search algorithm. J proteome Res. 2004;3(5):958–964. doi: 10.1021/pr0499491. - DOI - PubMed
    1. Baumgardner L. et al.Fast parallel tandem mass spectral library searching using GPU hardware acceleration. J Proteome Res. 2011;10(6):2882–2888. doi: 10.1021/pr200074h. - DOI - PMC - PubMed

Publication types

LinkOut - more resources