Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 1;8(1):giy150.
doi: 10.1093/gigascience/giy150.

Annotation of the Giardia proteome through structure-based homology and machine learning

Affiliations

Annotation of the Giardia proteome through structure-based homology and machine learning

Brendan R E Ansell et al. Gigascience. .

Abstract

Background: Large-scale computational prediction of protein structures represents a cost-effective alternative to empirical structure determination with particular promise for non-model organisms and neglected pathogens. Conventional sequence-based tools are insufficient to annotate the genomes of such divergent biological systems. Conversely, protein structure tolerates substantial variation in primary amino acid sequence and is thus a robust indicator of biochemical function. Structural proteomics is poised to become a standard part of pathogen genomics research; however, informatic methods are now required to assign confidence in large volumes of predicted structures.

Aims: Our aim was to predict the proteome of a neglected human pathogen, Giardia duodenalis, and stratify predicted structures into high- and lower-confidence categories using a variety of metrics in isolation and combination.

Methods: We used the I-TASSER suite to predict structural models for ∼5,000 proteins encoded in G. duodenalis and identify their closest empirically-determined structural homologues in the Protein Data Bank. Models were assigned to high- or lower-confidence categories depending on the presence of matching protein family (Pfam) domains in query and reference peptides. Metrics output from the suite and derived metrics were assessed for their ability to predict the high-confidence category individually, and in combination through development of a random forest classifier.

Results: We identified 1,095 high-confidence models including 212 hypothetical proteins. Amino acid identity between query and reference peptides was the greatest individual predictor of high-confidence status; however, the random forest classifier outperformed any metric in isolation (area under the receiver operating characteristic curve = 0.976) and identified a subset of 305 high-confidence-like models, corresponding to false-positive predictions. High-confidence models exhibited greater transcriptional abundance, and the classifier generalized across species, indicating the broad utility of this approach for automatically stratifying predicted structures. Additional structure-based clustering was used to cross-check confidence predictions in an expanded family of Nek kinases. Several high-confidence-like proteins yielded substantial new insight into mechanisms of redox balance in G. duodenalis-a system central to the efficacy of limited anti-giardial drugs.

Conclusion: Structural proteomics combined with machine learning can aid genome annotation for genetically divergent organisms, including human pathogens, and stratify predicted structures to promote efficient allocation of limited resources for experimental investigation.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Pfam code agreement as a proxy for predicted protein structure quality. A query peptide sequence is submitted to I-TASSER software to predict its 3D structure (colored blue). Metrics describing the predicted structure (“model”) are extracted for downstream analysis. The model is compared with empirically determined protein crystal structures available in the PDB using TM-align, from which the closest structural homologue is identified ("reference"; colored red). Metrics describing the alignment are also extracted. Pfam codes are assigned to primary peptide sequences that constitute the model and reference structures using InterPro Scan software (lower right side). The presence of at least one matching Pfam code assigned to the query and reference peptides (“PFAM match”) indicates greater likelihood of structural similarity between the model and the reference. Models with this feature are assigned as “high-confidence.” The ability of each extracted metric (“Feature”) to predict the high-confidence category (“Factor”) is assessed, and then a RF classifier is trained to identify the factor using all available features.
Figure 2:
Figure 2:
Structure prediction and homology searching elaborates putative functions for query peptides. (A) Intersection of predicted structures for which Pfam codes were available via query or reference peptides. The majority of structures predicted from BLAST-annotated peptides (blue vertical bars) had at least one Pfam annotation that matched with the reference structure. The majority of peptides that lacked BLAST annotation (aka “hypothetical proteins”; black vertical bars) also lacked Pfam codes. A total of 824 proteins (792 hypothetical) for which no Pfam codes were annotated in the query or the reference are not displayed. (B) Differential abundance of Pfam codes assigned to query and reference peptides for 1,095 high-confidence pairs. (C) Number of unique Pfam codes available for query (orange) and reference (teal) peptides for 1,095 high-confidence pairs. The right-shifted distribution in reference-derived Pfam codes indicates an overall increase in annotation via this method.
Figure 3:
Figure 3:
A random forest classifier correctly identifies the majority of high-confidence models using I-TASSER software output and derived metrics. (A) Relative importance of 12 metrics used to predict the presence of matching Pfam terms between query peptides and reference peptides identified via structural homology searching. (B) Receiver operating characteristic curves for the best-performing individual metrics (AUC ≥0.7; Table 1) and the random forest classifier (“Exact_match_prediction”). The unbroken x = y line represents chance prediction.
Figure 4:
Figure 4:
Distribution of I-TASSER software output and derived metrics across high-confidence, high-confidence-like, lower-confidence, and lower-confidence-like models. The random forest classifier's prediction of confidence status (“Exact_match_prediction”) is outlined in black.
Figure 5:
Figure 5:
Computationally predicted structures for putative ferredoxin:NAD(P)H reductases (FNRs). The high confidence-like structure predicted for GL_87577 is similar to the predicted C-terminal of an Entamoeba histolytica protein previously annotated as glutamate synthase (EhNO1) [27]. EhNO1 exhibits FNR activity and, unlike bacterial enzymes such as the Thermogota maritime FNR (PDB code: 4YLF), does not require an alpha subunit. Tm FNR beta subunit: purple; alpha subunit: blue; FMN co-factor: green.

References

    1. Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics. 1998;14:846–56. - PubMed
    1. Altschul SF, Madden TL, Schäffer AA et al. . Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402. - PMC - PubMed
    1. Rost B. Twilight zone of protein sequence alignments. Protein Eng. 1999;12:85–94. - PubMed
    1. Illergård K, Ardell DH, Elofsson A. Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins. 2009;77:499–508. - PubMed
    1. Morrison HG, McArthur AG, Gillin FD, et al. . Genomic minimalism in the early diverging intestinal parasite Giardialamblia. Science. 2007;317:1921–6. - PubMed

Publication types

Substances