Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Dec 24:13:335.
doi: 10.1186/1471-2105-13-335.

Unlocking the potential of publicly available microarray data using inSilicoDb and inSilicoMerging R/Bioconductor packages

Affiliations

Unlocking the potential of publicly available microarray data using inSilicoDb and inSilicoMerging R/Bioconductor packages

Jonatan Taminau et al. BMC Bioinformatics. .

Abstract

Background: With an abundant amount of microarray gene expression data sets available through public repositories, new possibilities lie in combining multiple existing data sets. In this new context, analysis itself is no longer the problem, but retrieving and consistently integrating all this data before delivering it to the wide variety of existing analysis tools becomes the new bottleneck.

Results: We present the newly released inSilicoMerging R/Bioconductor package which, together with the earlier released inSilicoDb R/Bioconductor package, allows consistent retrieval, integration and analysis of publicly available microarray gene expression data sets. Inside the inSilicoMerging package a set of five visual and six quantitative validation measures are available as well.

Conclusions: By providing (i) access to uniformly curated and preprocessed data, (ii) a collection of techniques to remove the batch effects between data sets from different sources, and (iii) several validation tools enabling the inspection of the integration process, these packages enable researchers to fully explore the potential of combining gene expression data for downstream analysis. The power of using both packages is demonstrated by programmatically retrieving and integrating gene expression studies from the InSilico DB repository [https://insilicodb.org/app/].

PubMed Disclaimer

Figures

Figure 1
Figure 1
MDS plots. Visual inspection of two merged data sets using double-labeled Multi Dimensional Scaling (MDS) plots. In these MDS plots samples are labeled by color based on the target biological variable of interest and are labeled by symbol based on the study they originate from. On the left the two data sets are merged without any transformation and on the right the two data sets are merged by using the COMBAT method. It is intuitively clear from the MDS plots that samples cluster by study without any transformation and by disease after performing COMBAT.
Figure 2
Figure 2
Genewise density plots. Visual inspection of two merged data sets using gene-wise density plots. For the randomly selected MYL4 gene, density plots in each study are shown, colored by study. On the left the two data sets are merged without any transformation and on the right the two data sets are merged by using the COMBAT method. The genewise density plots show that after transformation the distribution is much more similar.
Figure 3
Figure 3
RLE plots. Visual inspection of two merged data sets using relative log expression plots. In these relative log expression plots samples are colored by study. For clarity purposes only 40 randomly selected samples are shown. On the left the two data sets are merged without any transformation and on the right the two data sets are merged by using the COMBAT method. After applying COMBAT the mean of the RLE is approximately 0 for all genes which indicates a good batch effect removal.
Figure 4
Figure 4
Genewise box plots. Visual inspection of two merged data sets using a gene-wise box plots. Boxplots of the randomly selected MYL4 gene are grouped by study and colored by the target biological variable of interest. On the left the two data sets are merged without any transformation and on the right the two data sets are merged by using the COMBAT method. After batch effect removal the distribution of the gene is much more similar between studies than without.
Figure 5
Figure 5
Dendrogram plots. Visual inspection of two merged data sets using dendrograms plots. In these dendrogram plots samples are labeled by a number corresponding to the study they originate from. For clarity purposes only 40 randomly selected samples are used to perform the hierarchical clustering. On the left the two data sets are merged without any transformation and on the right the two data sets are merged by using the COMBAT method. In the right plot it can be seen that samples originating from different studies are mixed, while on the left they are grouped per study.

References

    1. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–10. doi: 10.1093/nar/30.1.207. - DOI - PMC - PubMed
    1. Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM. et al.NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res. 2011;39(Database issue):D1005–10. - PMC - PubMed
    1. Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M, Emam I, Farne A, Hastings E, Holloway E. et al.ArrayExpress update–an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 2011;39(Database issue):D1002–4. - PMC - PubMed
    1. Larsson O, Sandberg R. Lack of correct data format and comparability limits future integrative microarray research. Nat Biotechnol. 2006;24(11):1322–3. doi: 10.1038/nbt1106-1322. - DOI - PubMed
    1. Taminau J, Steenhoff D, Coletta A, Meganck S, Lazar C, de Schaetzen V, Duque R, Molter C, Bersini H, Nowé A, Weiss Solís DY. inSilicoDb: an R/Bioconductor package for accessing human Affymetrix expert-curated datasets from GEO. Bioinformatics. 2011;27(22):3204–5. doi: 10.1093/bioinformatics/btr529. - DOI - PubMed

Publication types

MeSH terms

LinkOut - more resources