Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 May 1;33(9):1309-1316.
doi: 10.1093/bioinformatics/btw806.

A mass graph-based approach for the identification of modified proteoforms using top-down tandem mass spectra

Affiliations

A mass graph-based approach for the identification of modified proteoforms using top-down tandem mass spectra

Qiang Kou et al. Bioinformatics. .

Abstract

Motivation: Although proteomics has rapidly developed in the past decade, researchers are still in the early stage of exploring the world of complex proteoforms, which are protein products with various primary structure alterations resulting from gene mutations, alternative splicing, post-translational modifications, and other biological processes. Proteoform identification is essential to mapping proteoforms to their biological functions as well as discovering novel proteoforms and new protein functions. Top-down mass spectrometry is the method of choice for identifying complex proteoforms because it provides a 'bird's eye view' of intact proteoforms. The combinatorial explosion of various alterations on a protein may result in billions of possible proteoforms, making proteoform identification a challenging computational problem.

Results: We propose a new data structure, called the mass graph, for efficient representation of proteoforms and design mass graph alignment algorithms. We developed TopMG, a mass graph-based software tool for proteoform identification by top-down mass spectrometry. Experiments on top-down mass spectrometry datasets showed that TopMG outperformed existing methods in identifying complex proteoforms.

Availability and implementation: http://proteomics.informatics.iupui.edu/software/topmg/.

Contact: xwliu@iupui.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Comparison of a complex proteoform and its corresponding reference protein sequence in the database. The proteoform has an N-terminal truncation ‘MTTSE’, an amino acid mutation from ‘R’ to ‘K’, an insertion of ‘AA’, one phosphorylated serine residue, and two modified cysteine residues with carbamidomethylation.
Fig. 2
Fig. 2
Construction of mass graphs. (a) An illustration of the construction of a proteoform mass graph from a protein ARKTDAR and four variable PTMs: acetylation on K and the first R; methylation on R and K, phosphorylation on T, and dimethylation on K. Each node corresponds to a peptide bond, or the N- or C-terminus of the protein; each edge corresponds to an amino acid residue (red edges correspond to modified amino acid residues). The weight of each edge is the mass of its corresponding unmodified or modified residue (a scaling factor 1 is used to convert weights to integers). (b) An illustration of the construction of a spectral mass graph from a prefix residue mass spectrum 0,156,198,326,340,425,521,707. The spectrum is generated from a proteoform of RKTDA with an acetylation on the R, a methylation on the K, and a phosphorylation on the T. To simplify the mass graph, masses corresponding to proteoform suffixes (C-terminal fragment masses) are not shown. The full path from the start node y0 to the end node y7 is aligned with the bold path from node x1 to node x6. The path from y0 to y6 and the red bold path from x1 to x4 are consistent.
Fig. 3
Fig. 3
The algorithm for computing all the r-distance sets of a proteoform mass graph.
Fig. 4
Fig. 4
The running time and percentages of correctly identified PrSMs for the 11505 test PrSMs with 5 variable PTMs each when the parameter L is set as 10,20,,100.
Fig. 5
Fig. 5
The percentages of correctly identified PrSMs for the test PrSMs with various numbers of variable PTMs.
Fig. 6
Fig. 6
Histograms for the PrSMs reported from the first histone dataset by TopMG with L = 40 and MS-Align-E. (a) the number of matched fragment ions; (b) the number of variable PTM sites.

Similar articles

Cited by

References

    1. Bandeira N. et al. (2007) Protein identification by spectral networks analysis. Proc. Natl. Acad. Sci. USA, 104, 6140–6145. - PMC - PubMed
    1. Bhatia S. et al. (2012) Constrained de novo sequencing of conotoxins. J. Proteome Res., 11, 4191–4200. - PMC - PubMed
    1. Boutet E. et al. (2016) UniProtKB/Swiss-Prot, the manually annotated section of the UniProt knowledgebase: How to use the entry view. Plant Bioinformat Methods Protocols, 23–54. - PubMed
    1. Catherman A.D. et al. (2014) Top down proteomics: facts and perspectives. Biochem. Biophys. Res. Commun., 445, 683–693. - PMC - PubMed
    1. Cosgrove M.S., Wolberger C. (2005) How does the histone code work?. Biochem. Cell Biol., 83, 468–476. - PubMed