Uncovering hidden duplicated content in public transcriptomics data

Marta Rosikiewicz¹, Aurélie Comte, Anne Niknejad, Marc Robinson-Rechavi, Frederic B Bastian

Affiliations

PMID: 23487185
PMCID: PMC3595988
DOI: 10.1093/database/bat010

Uncovering hidden duplicated content in public transcriptomics data

Marta Rosikiewicz et al. Database (Oxford). 2013.

. 2013 Mar 13:2013:bat010.

doi: 10.1093/database/bat010. Print 2013.

Authors

Marta Rosikiewicz¹, Aurélie Comte, Anne Niknejad, Marc Robinson-Rechavi, Frederic B Bastian

Affiliation

¹ Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland.

PMID: 23487185
PMCID: PMC3595988
DOI: 10.1093/database/bat010

Abstract

As part of the development of the database Bgee (a dataBase for Gene Expression Evolution), we annotate and analyse expression data from different types and different sources, notably Affymetrix data from GEO and ArrayExpress, and RNA-Seq data from SRA. During our quality control procedure, we have identified duplicated content in GEO and ArrayExpress, affecting ∼14% of our data: fully or partially duplicated experiments from independent data submissions, Affymetrix chips reused in several experiments, or reused within an experiment. We present here the procedure that we have established to filter such duplicates from Affymetrix data, and our procedure to identify future potential duplicates in RNA-Seq data.

PubMed Disclaimer

References

1. Bastian F, Parmentier G, Roux J, et al. Bgee: integrating and comparing heterogeneous transcriptome data among species. In: Bairoch A, Cohen-Boulakia S, Froidevaux C, editors. Data Integration in the Life Sciences. Vol. 5109. Berlin/Heidelberg: Springer; 2008. pp. 124–131.
1. Barrett T, Troup DB, Wilhite SE, et al. NCBI GEO: archive for functional genomics data sets‚ 10 years on. Nucleic Acids Res. 2011;39:D1005–D1010. - PMC - PubMed
1. Parkinson H, Sarkans U, Kolesnikov N, et al. ArrayExpress update – an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 2011;39:D1002–D1004. - PMC - PubMed
1. Kodama Y, Shumway M, Leinonen R. The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res. 2012;40:D54–D56. - PMC - PubMed
1. Liu W-m, Mei R, Di X, et al. Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics. 2002;18:1593–1599. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Uncovering hidden duplicated content in public transcriptomics data

Affiliation

Uncovering hidden duplicated content in public transcriptomics data

Authors

Affiliation

Abstract

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Research Materials