Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May;39 Suppl 1(Suppl 1):e9153.
doi: 10.1002/rcm.9153. Epub 2021 Jul 20.

Large-scale tandem mass spectrum clustering using fast nearest neighbor searching

Affiliations

Large-scale tandem mass spectrum clustering using fast nearest neighbor searching

Wout Bittremieux et al. Rapid Commun Mass Spectrom. 2025 May.

Abstract

Rationale: Advanced algorithmic solutions are necessary to process the ever-increasing amounts of mass spectrometry data that are being generated. In this study, we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra.

Methods: falcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to low-dimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pairwise distance matrix without having to exhaustively perform all pairwise spectrum comparisons within the relevant precursor mass tolerance. Finally, density-based clustering is performed to group similar spectra into clusters.

Results: Several state-of-the-art spectrum clustering tools were evaluated using a large draft human proteome data set consisting of 25 million spectra, indicating that alternative tools produce clustering results with different characteristics. Notably, falcon generates larger highly pure clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing.

Conclusions: falcon is a highly efficient spectrum clustering tool, which is publicly available as an open source under the permissive BSD license at https://github.com/bittremieux/falcon.

PubMed Disclaimer

References

    1. Kim M-S, Pinto SM, Getnet D, Nirujogi RS, et al. A Draft Map of the Human Proteome. Nature 2014, 509, 575–581, DOI: 10.1038/nature13302. - DOI - PMC - PubMed
    1. Wilhelm M, Schlegl J, Hahne H, Gholami AM, et al. Mass-Spectrometry-Based Draft of the Human Proteome. Nature 2014, 509, 582–587, DOI: 10.1038/nature13319. - DOI - PubMed
    1. Huttlin EL, Ting L, Bruckner RJ, Gebreab F, et al. The BioPlex Network: A Systematic Exploration of the Human Interactome. Cell 2015, 162, 425–440, DOI: 10.1016/j.cell.2015.06.043. - DOI - PMC - PubMed
    1. Huttlin EL, Bruckner RJ, Paulo JA, Cannon JR, et al. Architecture of the Human Interactome Defines Protein Communities and Disease Networks. Nature 2017, 545, 505–509, DOI: 10.1038/nature22366. - DOI - PMC - PubMed
    1. Sadygov RG, Cociorva D, Yates JRI Large-Scale Database Searching Using Tandem Mass Spectra: Looking up the Answer in the Back of the Book. Nature Methods 2004, 1, 195–202, DOI: 10.1038/nmeth725. - DOI - PubMed

LinkOut - more resources