Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Nov 21;18(1):506.
doi: 10.1186/s12859-017-1925-0.

Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling

Affiliations

Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling

Daniel Castillo et al. BMC Bioinformatics. .

Abstract

Background: Nowadays, many public repositories containing large microarray gene expression datasets are available. However, the problem lies in the fact that microarray technology are less powerful and accurate than more recent Next Generation Sequencing technologies, such as RNA-Seq. In any case, information from microarrays is truthful and robust, thus it can be exploited through the integration of microarray data with RNA-Seq data. Additionally, information extraction and acquisition of large number of samples in RNA-Seq still entails very high costs in terms of time and computational resources.This paper proposes a new model to find the gene signature of breast cancer cell lines through the integration of heterogeneous data from different breast cancer datasets, obtained from microarray and RNA-Seq technologies. Consequently, data integration is expected to provide a more robust statistical significance to the results obtained. Finally, a classification method is proposed in order to test the robustness of the Differentially Expressed Genes when unseen data is presented for diagnosis.

Results: The proposed data integration allows analyzing gene expression samples coming from different technologies. The most significant genes of the whole integrated data were obtained through the intersection of the three gene sets, corresponding to the identified expressed genes within the microarray data itself, within the RNA-Seq data itself, and within the integrated data from both technologies. This intersection reveals 98 possible technology-independent biomarkers. Two different heterogeneous datasets were distinguished for the classification tasks: a training dataset for gene expression identification and classifier validation, and a test dataset with unseen data for testing the classifier. Both of them achieved great classification accuracies, therefore confirming the validity of the obtained set of genes as possible biomarkers for breast cancer. Through a feature selection process, a final small subset made up by six genes was considered for breast cancer diagnosis.

Conclusions: This work proposes a novel data integration stage in the traditional gene expression analysis pipeline through the combination of heterogeneous data from microarrays and RNA-Seq technologies. Available samples have been successfully classified using a subset of six genes obtained by a feature selection method. Consequently, a new classification and diagnosis tool was built and its performance was validated using previously unseen samples.

Keywords: Breast cancer; Cancer; Classification; Gene expression; Integration; Microarray; RNA-Seq; Random Forest; SVM; k-NN.

PubMed Disclaimer

Conflict of interest statement

Consent for publication

Not applicable.

Competing interests

Not applicable.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Microarray gene expression pipeline
Fig. 2
Fig. 2
RNA-Seq gene expression integration pipeline
Fig. 3
Fig. 3
Integrated pipeline followed for this study
Fig. 4
Fig. 4
Expression profile of training and test datasets before normalization
Fig. 5
Fig. 5
Expression profile of training and test datasets after normalization
Fig. 6
Fig. 6
Intersection of expressed genes in RNA-Seq, microarray and the integrated dataset
Fig. 7
Fig. 7
Gene expression values boxplot for the set of 98 expressed genes. Figure shows significant differences between expression values for MCF7 and HS578T cancer cell lines and MCF10A non-cancer cell line
Fig. 8
Fig. 8
Hierarchical cluster using the 98 invariant expressed genes
Fig. 9
Fig. 9
Validation and test classification results with SVM, RF and k-NN using the most relevant genes obtained by mRMR
Fig. 10
Fig. 10
Hierarchical cluster over healthy and breast cancer samples using the top 6 genes
Fig. 11
Fig. 11
Average expression value boxplots of the six most relevant genes obtained in this study

References

    1. OMS. Women’s health. 2013. http://www.who.int/mediacentre/factsheets/fs334/en/.
    1. Gohlmann H, Talloen W. Gene Expression Studies Using Affymetrix Microarrays: CRC Press.
    1. Illumina. Illumina Genes Expression arrays. 2009. http://www.exiqon.com/microrna-microarray-analysis.
    1. Zahurak M, Parmigiani G, Yu W, Scharpf RB, Berman D, Schaeffer E, Shabbeer S, Cope L. Pre-processing agilent microarray data. BMC Bioinformatics. 2007;8(1):142. doi: 10.1186/1471-2105-8-142. - DOI - PMC - PubMed
    1. Exiqon. Exiqon Genes Expression arrays. 2009. http://www.illumina.com/techniques/microarrays/gene-expression-arrays.html.