Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Oct;83(20):10719-36.
doi: 10.1128/JVI.00595-09. Epub 2009 Jul 29.

Overlapping genes produce proteins with unusual sequence properties and offer insight into de novo protein creation

Affiliations

Overlapping genes produce proteins with unusual sequence properties and offer insight into de novo protein creation

Corinne Rancurel et al. J Virol. 2009 Oct.

Abstract

It is widely assumed that new proteins are created by duplication, fusion, or fission of existing coding sequences. Another mechanism of protein birth is provided by overlapping genes. They are created de novo by mutations within a coding sequence that lead to the expression of a novel protein in another reading frame, a process called "overprinting." To investigate this mechanism, we have analyzed the sequences of the protein products of manually curated overlapping genes from 43 genera of unspliced RNA viruses infecting eukaryotes. Overlapping proteins have a sequence composition globally biased toward disorder-promoting amino acids and are predicted to contain significantly more structural disorder than nonoverlapping proteins. By analyzing the phylogenetic distribution of overlapping proteins, we were able to confirm that 17 of these had been created de novo and to study them individually. Most proteins created de novo are orphans (i.e., restricted to one species or genus). Almost all are accessory proteins that play a role in viral pathogenicity or spread, rather than proteins central to viral replication or structure. Most proteins created de novo are predicted to be fully disordered and have a highly unusual sequence composition. This suggests that some viral overlapping reading frames encoding hypothetical proteins with highly biased composition, often discarded as noncoding, might in fact encode proteins. Some proteins created de novo are predicted to be ordered, however, and whenever a three-dimensional structure of such a protein has been solved, it corresponds to a fold previously unobserved, suggesting that the study of these proteins could enhance our knowledge of protein space.

PubMed Disclaimer

Figures

FIG. 1.
FIG. 1.
Creation of a novel protein region (C-terminal extension) by overprinting. Top, a DNA sequence encodes two proteins in different reading frames. Notice the potential, unused stop codon downstream of protein X. Middle, a mutation abolishes the stop codon of protein X, causing its elongation (“overprinting”) to the preexisting stop codon. This results in a gene overlap. Bottom, the overlap encodes an overprinted (ancestral) protein region (dark gray) and an overprinting (novel) protein region (light gray).
FIG. 2.
FIG. 2.
Structural and functional prediction work flow, showing the Betatetravirus replicase/capsid overlap. Conventions are the same as in Fig. 1. Second panel, superimposed PONDR prediction for the capsid (dark gray) and replicase (light gray). Regions with a score of above 0.5 are predicted to be disordered. Third panel, predictions of the boundaries of ancestral and novel regions of the replicase and capsid (see text). Bottom, result of refined structural and functional analysis (see text). Wide and narrow boxes correspond, respectively, to predicted order and disorder. Domain names were obtained from the literature. Note the good agreement between automated PONDR predictions and the refined analysis.
FIG. 3.
FIG. 3.
Predicted disorder content of proteins encoded by overlapping genes. The prediction was made using PONDR VSL2. The error bars correspond to a 95% confidence interval.
FIG. 4.
FIG. 4.
Structural and functional organization of recognizable ancestral/novel overlapping protein regions. Proteins encoded by overlapping genes are represented to scale with the same conventions as in Fig. 1 and 2. Boundaries of ancestral and novel regions are given in Table 7. Each panel represents different cases of overprinting. For instance, the panel 3 represents all novel proteins that have overprinted homologous capsid proteins. The name of the panel refers to the PFAM family (in parentheses) or clan (in brackets), actual or proposed herein, to which ancestral protein regions belong (see text and Table 7). Ancestral regions within a given clan are aligned vertically (e.g., the 30K domain of Umbra-, Tombus-, and Capillovirus movement proteins, in panel 4). Note that domains bearing a similar name are not always homologous. For instance, in panel 2 the Pomovirus and Potexvirus TGBp2 proteins are homologous (they belong to the family Plant_vir_prot), whereas the Pomovirus and Potexvirus TGBp3 proteins are not (they belong, respectively, to the β C/D and 7K families) (Table 7). Likewise, there is no evidence that the RNA-binding “arms” of capsid proteins of different genera are homologous (panel 3). Abbreviations: 30K, conserved domain of the 30K family of movement proteins; al: antigenic loop; B (or B1 or B2), base domain (or subdomain); Flexi coat, central conserved region of flexuous viral coats; Ig, immunoglobulin-like domain; L, large envelope protein; LDM, long-distance movement protein; NABP, nucleic acid-binding protein; Prol-rich, proline-rich region; RNP, ribonucleoprotein; Rdrp: RNA-dependent RNA polymerase; RT, reverse transcriptase; S (or S1 or S2), shell domain (or subdomain); tm: transmembrane segment; TGBp2 and TGBp3: triple gene block proteins 2 and 3; TP, terminal protein.
FIG. 5.
FIG. 5.
REs of overlapping or nonoverlapping protein regions versus Swiss-Prot. The RE of two data sets is a rough measure of their difference in mean amino acid composition (see text). We have plotted, from left to right, the REs of biologically meaningful data sets (PDB and Disprot) with respect to Swiss-Prot; the RE of nonoverlapping regions (representative of viral proteins) with respect to Swiss-Prot; and the REs with respect to Swiss-Prot of either all overlapping regions, ancestral regions, or novel regions. Note that ancestral and novel regions form only a subset of all overlapping regions, since for some pairs of overlapping regions we could not determine which was the ancestral one and which was the novel one.
FIG. 6.
FIG. 6.
Deviation in sequence composition of overlapping protein regions relative to the background composition of nonoverlapping regions. Relative enrichment (positive values) or depletion (negative values) in amino acids of each data set with respect to that of nonoverlapping regions is shown (see text). For easier visualization, we have plotted values only for the amino acids that show a statistically significant bias (P < 0.01). Amino acids are arranged according to their level of codon degeneracy, indicated below the lower panel (a codon degeneracy of 3 for isoleucine [I] means that three codons code for isoleucine). The dashed vertical lines separate amino acids with a high codon degeneracy (≥4) from those with a low degeneracy (≤3). Note that the data sets of novel and ancestral regions (2,280 aa each) represent only 22% of the amino acids contained in “all overlapping regions”. Thus, the composition of all overlapping regions is not expected to correspond exactly to the mean composition of the ancestral and novel subsets.
FIG. 7.
FIG. 7.
Evolutionary constraints of overlapping protein regions and their disorder content. Predicted disorder content is plotted for overlapping protein pairs from several viruses, listed below the graph. In each pair, the first protein listed is the more constrained. Bars indicate the percentage of disorder in the overlapping parts of these proteins. Abbreviations: HBV, hepatitis B virus; CLCuV, cotton leaf curl virus; SIV, simian immunodeficiency virus; HTLV, human T-lymphotropic virus; φX174, coliphage φX174; PLRV, potato leafroll virus; HPV, human papillomavirus.

References

    1. Abramowitz, J., D. Grenet, M. Birnbaumer, H. N. Torres, and L. Birnbaumer. 2004. XLalphas, the extra-long form of the alpha-subunit of the Gs G protein, is significantly longer than suspected, and so is its companion Alex. Proc. Natl. Acad. Sci. USA 101:8366-8371. - PMC - PubMed
    1. Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. - PMC - PubMed
    1. Andreeva, A., D. Howorth, J. M. Chandonia, S. E. Brenner, T. J. Hubbard, C. Chothia, and A. G. Murzin. 2008. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 36:D419-D425. - PMC - PubMed
    1. Bairoch, A., R. Apweiler, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. A. Natale, C. O'Donovan, N. Redaschi, and L. S. Yeh. 2005. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33:D154-D159. - PMC - PubMed
    1. Ball, L. A. 2007. Virus replication strategies, p. 119-139. In D. M. Knipe and P. M. Howley (ed.), Fields virology, 5th ed., vol. 1. Lippincott Williams & Wilkins, Philadelphia, PA.

LinkOut - more resources