Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Apr 20:10:163.
doi: 10.1186/1471-2164-10-163.

A transcriptional sketch of a primary human breast cancer by 454 deep sequencing

Affiliations

A transcriptional sketch of a primary human breast cancer by 454 deep sequencing

Alessandro Guffanti et al. BMC Genomics. .

Abstract

Background: The cancer transcriptome is difficult to explore due to the heterogeneity of quantitative and qualitative changes in gene expression linked to the disease status. An increasing number of "unconventional" transcripts, such as novel isoforms, non-coding RNAs, somatic gene fusions and deletions have been associated with the tumoral state. Massively parallel sequencing techniques provide a framework for exploring the transcriptional complexity inherent to cancer with a limited laboratory and financial effort. We developed a deep sequencing and bioinformatics analysis protocol to investigate the molecular composition of a breast cancer poly(A)+ transcriptome. This method utilizes a cDNA library normalization step to diminish the representation of highly expressed transcripts and biology-oriented bioinformatic analyses to facilitate detection of rare and novel transcripts.

Results: We analyzed over 132,000 Roche 454 high-confidence deep sequencing reads from a primary human lobular breast cancer tissue specimen, and detected a range of unusual transcriptional events that were subsequently validated by RT-PCR in additional eight primary human breast cancer samples. We identified and validated one deletion, two novel ncRNAs (one intergenic and one intragenic), ten previously unknown or rare transcript isoforms and a novel gene fusion specific to a single primary tissue sample. We also explored the non-protein-coding portion of the breast cancer transcriptome, identifying thousands of novel non-coding transcripts and more than three hundred reads corresponding to the non-coding RNA MALAT1, which is highly expressed in many human carcinomas.

Conclusion: Our results demonstrate that combining 454 deep sequencing with a normalization step and careful bioinformatic analysis facilitates the discovery and quantification of rare transcripts or ncRNAs, and can be used as a qualitative tool to characterize transcriptome complexity, revealing many hitherto unknown transcripts, splice isoforms, gene fusion events and ncRNAs, even at a relatively low sequence sampling.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Assessment of the cDNA library normalization before sequencing. RT-PCR amplification of three reference genes, each expressed constitutively but with different abundance in the cell, to test the normalization of the cDNA library before sequencing. X = control before normalization. N = normalized library.
Figure 2
Figure 2
Distribution and statistics of the cDNA reads. Distribution and statistics of the non-redundant cDNA sequences for the initial nebulized 454 reads dataset (251,262 sequences). The independent variable (X axis) is the read length, the dependent variable is the sequence count corresponding to each length bin. The red line is an approximation to a normal distribution, with mean of 85.6 and an estimate dispersion of 22.1
Figure 3
Figure 3
Uniformity of sequence coverage across transcripts. Sampling of an 'ideal' complete target transcripts by the 98.98.1 read dataset. 0 means that the 454 sequence identifies a point at the 5' of the target RefSeq transcript, while a value of 100 correspond to a sampling of the 3' end. The upper box is an Outlier Box plot, representing the interquartile range, the mean and the limits of the outliers. The red line represent the shortest area in which 50% of the data are represented. The total number of matches represented in this plot is 74,208.
Figure 4
Figure 4
Distribution of Conserved Sequence Tags (CST) in relation with cDNA genome mapping. Plot of the percentage of the CST conserved segments overlapping cDNA reads for each of the following categories: matching only once in the genome; matching from 2 to 10 times; matching more than 10 times. CST_UC = CST with unknown coding potential; CST_COD = CST with coding potential; CST_NCOD = CST without coding potential; CST_UND = undetermined coding potential.
Figure 5
Figure 5
Identification of a UBR4/GBL1 fusion transcript. (A, B) Details at the nucleotide level of the fusion event. Splice junctions are indicated in red. (C) alignment of the divergent parts of the original GLB1 peptide and of the fusion UBR4-GLB1. Residues before 4,433 are identical. The yellow and green background identifies identical or conserved residues.
Figure 6
Figure 6
Identification of a deletion within the WHSC1L1 gene. (A, B) Alignment of the read 1B with the corresponding genome regions of WHSC1L1. Splice junction nucleotides are indicated in red. (C) Schematic representation of the putative intragenic deletion, (probably due to a looped transcript) in WHSC1L1, identified by the read 1B. The green and blue arrows represent the two halves of the fusion transcript which map on the opposite order to the genome.
Figure 7
Figure 7
Identification of a novel isoform of the BC031316gene. (A) Alignment of the sequence read AI4 with the human genome. (B) UCSC human genome screenshot showing the novel isoform, with a longer first exon, of the annotated BC031316 gene, identified by the sequence read AI4 and further supported by many annotated spliced ESTs.
Figure 8
Figure 8
Identification of a putative cancer-associated isoform. UCSC genome screenshot showing the region where the 176265_1466_0318 read maps, clearly identifying the shorter protein isoform (1b) of HIGD1A, which is predicted to be significantly associated with the 'cancer' phenotype according to ASAPII EST and transcript analysis. The black block identified as 27858 is the ASAPII intron associated with the cancer-specific HIGD1A isoform 1b splice site.
Figure 9
Figure 9
Abundant representation of MALAT1 ncRNA in 454 sequences from the breast tumor sample. UCSC human genome screenshot of the region containing the MALAT1 locus. 309 cDNA reads map with high confidence along MALAT-1, a ncRNA highly correlated with poor prognosis in several tumor types. These were assembled in 14 contigs, which are reported in the figure.
Figure 10
Figure 10
Identification of a novel conserved noncoding transcript. UCSC human genome 8 kb screenshot of a "gene desert region" (no known gene in a 50 kb boundary) on the X chromosome tagged as transcribed by the sequencing read 002770_3171_2414. The read overlaps a CRITICA-predicted putative noncoding transcript (CR621898) and points to a new, highly conserved transcriptional island, according to a vertebrate 28 multi-species alignment and PhastCons conservation score.
Figure 11
Figure 11
Mapping the transcriptome to GO Molecular Function. GO Molecular Function mappings to the genes identified by the cDNA reads correlated with RefSeq transcriptome.

References

    1. Carninci P, Yasuda J, Hayashizaki Y. Multifaceted mammalian transcriptome. Curr Opin Cell Biol. 2008;20:274–80. doi: 10.1016/j.ceb.2008.03.008. - DOI - PubMed
    1. Furuno M, Pang KC, Ninomiya N, Fukuda S, Frith MC, Bult C, Kai C, Kawai J, Carninci P, Hayashizaki Y, Mattick JS, Suzuki H. Clusters of internally primed transcripts reveal novel long noncoding RNAs. PLoS Genet. 2006;2:e37. doi: 10.1371/journal.pgen.0020037. - DOI - PMC - PubMed
    1. Wu Jia Qian, Du Jiang, Rozowsky Joel, Zhang Zhengdong, Urban AlexanderE, Ghia Euskirchen, Sherman Weissman, Gerstein Mark, Snyder Michael. Systematic analysis of transcribed loci in ENCODE regions using RACE sequencing reveals extensive transcription in the human genome. Genome Biol. 2008;9:R3. doi: 10.1186/gb-2008-9-1-r3. - DOI - PMC - PubMed
    1. Mattick JS, Makunin IV. Non-coding RNA. Hum Mol Genet. 2006;15:R17–29. doi: 10.1093/hmg/ddl046. - DOI - PubMed
    1. Prasanth KV, Spector DL. Eukaryotic regulatory RNAs: an answer to the 'genome complexity' conundrum. Genes Dev. 2007;21:11–42. doi: 10.1101/gad.1484207. - DOI - PubMed

Publication types

MeSH terms