Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Oct 21;4(10):e7546.
doi: 10.1371/journal.pone.0007546.

A statistical model of protein sequence similarity and function similarity reveals overly-specific function predictions

Affiliations

A statistical model of protein sequence similarity and function similarity reveals overly-specific function predictions

Brenton Louie et al. PLoS One. .

Abstract

Background: Predicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity.

Methodology: Our statistical model is based on sets of proteins with experimentally validated functions and numeric measures of function specificity and function similarity derived from the Gene Ontology. The model predicts the similarity of function between two proteins given their amino acid sequence similarity measured by statistics from the BLAST sequence alignment algorithm. A novel aspect of our model is that it predicts the degree of function similarity shared between two proteins over a continuous range of sequence similarity, facilitating prediction of function with an appropriate level of specificity.

Significance: Our model shows nearly exact function similarity for proteins with high sequence similarity (bit score >244.7, e-value >1e(-62), non-redundant NCBI protein database (NRDB)) and only small likelihood of specific function match for proteins with low sequence similarity (bit score <54.6, e-value <1e(-05), NRDB). For sequence similarity ranges in between our annotation model shows an increasing relationship between function similarity and sequence similarity, but with considerable variability. We applied the model to a large set of proteins of unknown function, and predicted functions for thousands of these proteins ranging from general to very specific. We also applied the model to a data set of proteins with previously assigned, specific functions that were electronically based. We show that, on average, these prior function predictions are more specific (quite possibly overly-specific) compared to predictions from our model that is based on proteins with experimentally determined function.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. BLAST bit score (log) vs GO term depth.
The trend shown is a lowess line. GO level generally increases with higher bit scores, however there is a high degree of variability in GO level over all ranges of bit scores, even for bit scores above 6.0 which indicate a high degree of sequence similarity.
Figure 2
Figure 2. BLAST bit score (log) versus IC.
The trend shown is a lowess line. The IC of GO terms generally increases with higher bit scores. IC is less variable than GO level across most bit score ranges, however there remains a significant degree of variability even above a bit score of 6.0.
Figure 3
Figure 3. BLAST bit score (log) versus RIC.
The trend shown is a lowess line. The RIC statistic normalizes the variability of IC values in the training data. RIC is the least variable statistic across most bit score ranges. This is especially so for bit score ranges above 6.0, where all RIC values are 1.0 (no variability).
Figure 4
Figure 4. Correlation between log-transformed BLAST statistics.
BLAST statistics are generally highly correlated with each other. If two variables are highly correlated the information they provide about a response variable (i.e. RIC) is not independent. Generally only one of the variables in this case will add significant predictive power to a statistical model.
Figure 5
Figure 5. GLM and GAM model fits on the Training, Test, and Combined data sets.
Functional similarity (RIC) between two proteins generally increases for higher similarity levels, measured by bit score. The RIC predictions for the GLM and GAM model fits to the Test data (solid and dashed green lines) are somewhat higher than the GLM and GAM model fits to the Training data (solid and dashed black lines), indicating some bias in the data sets. GLM and GAM models fits using a combined data set (training + test) may be more general for prediction (solid and dashed blue lines). There is not a significant difference between the GAM and GLM model fits on any data set.
Figure 6
Figure 6. GLM model fits from BLAST alignments generated from proteins with experimental functions only and from proteins with electronic annotations.
The GLM models fit on data containing only experimental annotations (solid lines) predict a lower RIC for most ranges of bit scores than for models fit using electronic annotations (dashed lines), for both the Training (black lines) and Test (green lines) data sets. The difference in predicted RIC is greatest for (log) bit score ranges of about 4.0 to 5.0 (bit scores 54.6 to 148.4, e-values 1e−05 to 1e−34, NRDB).

References

    1. Galperin MY, Koonin EV. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. (Gedrukt) 1998;1:55–67. - PubMed
    1. Karp PD. What we do not know about sequence analysis and sequence databases. Bioinformatics. 1998;14:753–754. - PubMed
    1. Brenner SE. Errors in genome annotation. Trends in Genetics. 1999;15:132–133. doi: 10.1016/S0168-9525(99)01706-0. - DOI - PubMed
    1. Thomas GH. Completing the E. coli proteome: a database of gene products characterised since the completion of the genome sequence. Bioinformatics. 1999;15:860–861. - PubMed
    1. Bork P. Powers and pitfalls in sequence analysis: the 70% hurdle. Genome Res. 2000;10:398–400. - PubMed

Publication types