Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun 7;18(6):2433-2445.
doi: 10.1021/acs.jproteome.8b00935. Epub 2019 May 8.

Proteogenomic Annotation of Chinese Hamsters Reveals Extensive Novel Translation Events and Endogenous Retroviral Elements

Affiliations

Proteogenomic Annotation of Chinese Hamsters Reveals Extensive Novel Translation Events and Endogenous Retroviral Elements

Shangzhong Li et al. J Proteome Res. .

Abstract

A high-quality genome annotation greatly facilitates successful cell line engineering. Standard draft genome annotation pipelines are based largely on de novo gene prediction, homology, and RNA-Seq data. However, draft annotations can suffer from incorrect predictions of translated sequence, inaccurate splice isoforms, and missing genes. Here, we generated a draft annotation for the newly assembled Chinese hamster genome and used RNA-Seq, proteomics, and Ribo-Seq to experimentally annotate the genome. We identified 3529 new proteins compared to the hamster RefSeq protein annotation and 2256 novel translational events (e.g., alternative splices, mutations, and novel splices). Finally, we used this pipeline to identify the source of translated retroviruses contaminating recombinant products from Chinese hamster ovary (CHO) cell lines, including 119 type-C retroviruses, thus enabling future efforts to eliminate retroviruses to reduce the costs incurred with retroviral particle clearance. In summary, the improved annotation provides a more accurate resource for CHO cell line engineering, by facilitating the interpretation of omics data, defining of cellular pathways, and engineering of complex phenotypes.

Keywords: Chinese hamster; endogenous retrovirus; genome annotation; proteogenomics.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. Overview of the proteogenomic pipeline.
Multiple databases of putative protein sequences were generated based on the newly assembled hamster genome and additional data. The KnownDB contains protein sequences from our draft annotation generated here. The SNP/SpliceDB was derived from RNA-Seq samples, and contains candidate mutated or novel spliced proteins compared to the draft annotation. The RiboDB was derived from predicted translated ORFs from Ribo-Seq and RNA-Seq. The SixFrameDB is derived from the reference genome. After database construction, mass spectra were mapped against the protein databases using MSGF+ to identify the peptides. The peptides were then mapped back to the genome and compared with the draft annotation to verify translated known proteins, enumerate novel translation events, and the identity of retroviral proteins.
Figure 2:
Figure 2:. Number of novel draft proteins verified by draft-only peptides in different categories.
The draft annotation predicted thousands of novel protein sequences. (A) Of these, 3,389 had peptides mapping to proteins uniquely supporting the novel protein sequences. (B) Only 140 did not have extra peptide support from peptides that map to proteins uniquely, and thousands provided peptide support. RefSeq perfect short: RefSeq proteins map perfectly but are shorter than draft proteins; High quality: high quality mapping proteins between draft and RefSeq; Draft high quality: draft proteins map to RefSeq with high quality, but the reverse doesn’t hold; RefSeq high quality: RefSeq proteins map to draft with high quality, but the reverse doesn’t hold; Low quality: low quality mapping between draft and RefSeq.
Figure 3:
Figure 3:. Proteogenomics and RiboTaper verified predicted protein sequences and identified novel translation events.
(A) Numerous novel translational events were identified, including novel splice sites that are not in the draft annotation file (new splice), non-synonymous mutations (SNP), peptides that map to UTR regions or to transcripts with no CDS (new CDS), alternative splice sites (alter splice), peptide mapping to reverse strand of reference CDS (reverse), insertions (INS), peptide mapping to intergenic regions (new gene), deletion (DEL), and gene fusions connecting two genes (fusion). (B) Statistics for the number of spectra, peptides and protein isoforms identified in proteogenomics. (C) Number of ORFs identified using RiboTaper. Outer circle: Number of transcripts predicted with single ORF (blue) or multiple ORFs (orange). Inner circle: Number of transcripts with (darker blue and orange) or without (light blue and orange) peptide support. (D) Number of proteins that are shorter/longer than the draft annotation. Positive x axis means the RiboTaper proteins are shorter (i.e., start later) than the draft annotation.
Figure 4:
Figure 4:. Hundreds of SNPs in hamster and different CHO cell lineages are validated.
A comparison of the (A) distribution of SNP types identified from RNA-Seq and (B) SNP types verified by proteomics validates the overall distribution of SNPs. (C) Peptide-validated non-synonymous SNPs are located throughout the protein bodies. The length of each protein is scaled to 1, and 0 represents the start codon. SNPs that locate below 0 or above 1 represent peptide-supported SNPs in 5’-UTR and 3’-UTR regions, respectively. (D) Venn diagram of 353 peptide-supported SNPs from CHO-K1, CHO-S and DG44 cell lines shows that most SNPs are shared across cell lines.
Figure 5:
Figure 5:. A proteogenomic identification of the source of translated endogenous retroviral particles shed from CHO cells.
(A) Two strategies were taken to identify translated retroviral loci. In strategy 1, peptides were mapped to the annotated retroviral proteins. For strategy 2, the sequences from the NCBI retroviral protein database were aligned to the genome using BLASTP. Then we evaluated the overlap of these aligned peptides with the novel peptides identified from the novel databases in our proteogenomics pipeline. (B) The strategies recovered 119 type-C peptide-supported retroviral proteins in CHO cell lines (the “other” category represents non-typical retroviral proteins, such as the p12 protein). (C) Peptide-supported type-C virus proteins were analyzed to assess the portion of protein sequence covered by peptides against peptide number. (D) Coverage of an envelope protein in reverse strand. uni: reads map uniquely to the locus, sec: reads are secondary reads and map to multiple loci.

References

    1. Golabgir A; Gutierrez JM; Hefzi H; Li S; Palsson BO; Herwig C; Lewis NE Quantitative Feature Extraction from the Chinese Hamster Ovary Bioprocess Bibliome Using a Novel Meta-Analysis Workflow. Biotechnol. Adv. 2016, 34 (5), 621–633. 10.1016/j.biotechadv.2016.02.011. - DOI - PubMed
    1. Lin FK; Suggs S; Lin CH; Browne JK; Smalling R; Egrie JC; Chen KK; Fox GM; Martin F; Stabinsky Z Cloning and Expression of the Human Erythropoietin Gene. Proc. Natl. Acad. Sci. U. S. A. 1985, 82 (22), 7580–7584. - PMC - PubMed
    1. Xu X; Nagarajan H; Lewis NE; Pan S; Cai Z; Liu X; Chen W; Xie M; Wang W; Hammond S; et al. The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 Cell Line. Nat. Biotechnol. 2011, 29 (8), 735–741. 10.1038/nbt.1932. - DOI - PMC - PubMed
    1. Lewis NE; Liu X; Li Y; Nagarajan H; Yerganian G; O’Brien E; Bordbar A; Roth AM; Rosenbloom J; Bian C; et al. Genomic Landscapes of Chinese Hamster Ovary Cell Lines as Revealed by the Cricetulus Griseus Draft Genome. Nat. Biotechnol. 2013, 31 (8), 759–765. 10.1038/nbt.2624. - DOI - PubMed
    1. Brinkrolf K; Rupp O; Laux H; Kollin F; Ernst W; Linke B; Kofler R; Romand S; Hesse F; Budach WE; et al. Chinese Hamster Genome Sequenced from Sorted Chromosomes. Nat. Biotechnol. 2013, 31, 694. - PubMed

Publication types