Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;8(1):e54210.
doi: 10.1371/journal.pone.0054210. Epub 2013 Jan 22.

Finding protein-coding genes through human polymorphisms

Affiliations

Finding protein-coding genes through human polymorphisms

Edward Wijaya et al. PLoS One. 2013.

Abstract

Human gene catalogs are fundamental to the study of human biology and medicine. But they are all based on open reading frames (ORFs) in a reference genome sequence (with allowance for introns). Individual genomes, however, are polymorphic: their sequences are not identical. There has been much research on how polymorphism affects previously-identified genes, but no research has been done on how it affects gene identification itself. We computationally predict protein-coding genes in a straightforward manner, by finding long ORFs in mRNA sequences aligned to the reference genome. We systematically test the effect of known polymorphisms with this procedure. Polymorphisms can not only disrupt ORFs, they can also create long ORFs that do not exist in the reference sequence. We found 5,737 putative protein-coding genes that do not exist in the reference, whose protein-coding status is supported by homology to known proteins. On average 10% of these genes are located in the genomic regions devoid of annotated genes in 12 other catalogs. Our statistical analysis showed that these ORFs are unlikely to occur by chance.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. The standard model of genetic information transfer in molecular biology.
Panel (a) shows the transfer begins with the DNA being transcribed into mRNA, and continues with protein being synthesized using information in mRNA as a template (translation). We investigate the effect of polymorphic modification of the mRNA. Panel (b) depicts how the new longer ORF was formed. The starting position of the new ORF in the mRNA is before that of original ORF. The new ORF may or may not overlap the original ORF, and if it does overlap, it is in different reading frame, so that the proteins are completely different.
Figure 2
Figure 2. Workflows for finding protein-coding genes.
Panel (a) describes the workflow of gene-finding without applying human polymorphism and (b) with human polymorphism. The values inside the brackets refer to the number mRNAs, ORFs and genes respectively. The final number of genes in workflow (b) refers to the genes where the ORFs change after modification, but in workflow (a) such change does not apply. For the second workflow (b) two main sources of data are used: human mRNA sequences and polymorphism data (dbSNP 131). Based on the polymorphism information we redefine the mRNA sequences. Out of the modified mRNA sequences we derived the longest ORFs. These ORFs are further refined by filtering them based on significant homology to Swiss-Prot and proximity to 5′ UTR. Finally we construct the genes from the refined ORFs.
Figure 3
Figure 3. Modification by polymorphism of mRNA AK124706 and its ORFs.
In the reference genome the modification is caused by an insertion (rs66651466) with ‘AT’ as the allele. The initial longest ORF before modification has length 302 bp. The new longest ORF has length 336 bp, and it aligns to Swiss-Prot Integrin beta-5 protein (Acc:P18084). Annotation of start/stop codon in the translation process and alleles that cause the change can be found in Figure 2.
Figure 4
Figure 4. Modification by polymorphism of mRNA AK127273 and its ORFs.
The initial longest ORF before modification has length 546 bp. The longest ORF after modification has length 594 bp. The polymorphism responsible for the modification is an in-del (rs71162510) which replaces the reference genome allele ‘C’ with ‘TGCCCC’.
Figure 5
Figure 5. Modification by polymorphism of mRNA AY129028 and its ORFs.
The initial longest ORF before modification has length 309 bp, and after has length 357 bp. The polymorphism that effects the modification is a SNP (rs8011546) which replaces the reference allele ‘G’ with ‘A’.
Figure 6
Figure 6. Cumulative allele frequency from 11 populations.
In panel (a) we plot the allele percentage of new ORFs and (b) allele percentage of all HapMap data in UCSC Genome Browser. The percentage (formula image-axis) in panel (b) is based on Allele1, chosen arbitrarily.

Similar articles

Cited by

References

    1. The International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431: 931–45. - PubMed
    1. Mathé C, Sagot M, Schiex T, Rouzé P (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Research 30: 4103–17. - PMC - PubMed
    1. Clamp M, Fry B, Kamal M, Xie X, Cuff J, et al. (2007) Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci U S A 104: 19428–33. - PMC - PubMed
    1. Brent M (2005) Genome annotation past, present, and future: how to define an ORF at each locus. Genome Research 15: 1777–86. - PubMed
    1. Genomes Project Consortium (2011) Durbin R, Abecasis G, Altshuler D, Auton A, et al. (2011) A map of human genome variation from population-scale sequencing. Nature 470: 59–65. - PMC - PubMed

Publication types

LinkOut - more resources