Consistent RNA sequencing contamination in GTEx and other data sets
- PMID: 32321923
- PMCID: PMC7176728
- DOI: 10.1038/s41467-020-15821-9
Consistent RNA sequencing contamination in GTEx and other data sets
Abstract
A challenge of next generation sequencing is read contamination. We use Genotype-Tissue Expression (GTEx) datasets and technical metadata along with RNA-seq datasets from other studies to understand factors that contribute to contamination. Here we report, of 48 analyzed tissues in GTEx, 26 have variant co-expression clusters of four highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and/or CELA3A). Fourteen additional highly expressed genes from other tissues also indicate contamination. Sample contamination is strongly associated with a sample being sequenced on the same day as a tissue that natively expresses those genes. Discrepant SNPs across four contaminating genes validate the contamination. Low-level contamination affects ~40% of samples and leads to numerous eQTL assignments in inappropriate tissues among these 18 genes. This type of contamination occurs widely, impacting bulk and single cell (scRNA-seq) data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses.
Conflict of interest statement
The authors declare no competing interests.
Figures



Similar articles
-
A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa.PLoS Comput Biol. 2018 Apr 9;14(4):e1006053. doi: 10.1371/journal.pcbi.1006053. eCollection 2018 Apr. PLoS Comput Biol. 2018. PMID: 29630593 Free PMC article.
-
Detection of high variability in gene expression from single-cell RNA-seq profiling.BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):508. doi: 10.1186/s12864-016-2897-6. BMC Genomics. 2016. PMID: 27556924 Free PMC article.
-
Missing data and technical variability in single-cell RNA-sequencing experiments.Biostatistics. 2018 Oct 1;19(4):562-578. doi: 10.1093/biostatistics/kxx053. Biostatistics. 2018. PMID: 29121214 Free PMC article.
-
Processing and Analysis of RNA-seq Data from Public Resources.Methods Mol Biol. 2021;2243:81-94. doi: 10.1007/978-1-0716-1103-6_4. Methods Mol Biol. 2021. PMID: 33606253 Review.
-
Tools for the analysis of high-dimensional single-cell RNA sequencing data.Nat Rev Nephrol. 2020 Jul;16(7):408-421. doi: 10.1038/s41581-020-0262-0. Epub 2020 Mar 27. Nat Rev Nephrol. 2020. PMID: 32221477 Review.
Cited by
-
Brooklyn plots to identify co-expression dysregulation in single cell sequencing.NAR Genom Bioinform. 2024 Jan 11;6(1):lqad112. doi: 10.1093/nargab/lqad112. eCollection 2024 Mar. NAR Genom Bioinform. 2024. PMID: 38213836 Free PMC article.
-
Integrative Prioritization of Causal Genes for Coronary Artery Disease.Circ Genom Precis Med. 2022 Feb;15(1):e003365. doi: 10.1161/CIRCGEN.121.003365. Epub 2021 Dec 28. Circ Genom Precis Med. 2022. PMID: 34961328 Free PMC article.
-
Targeting Lymphoma-associated Macrophage Expansion via CSF1R/JAK Inhibition is a Therapeutic Vulnerability in Peripheral T-cell Lymphomas.Cancer Res Commun. 2022 Dec 30;2(12):1727-1737. doi: 10.1158/2767-9764.CRC-22-0336. eCollection 2022 Dec. Cancer Res Commun. 2022. PMID: 36970721 Free PMC article.
-
Robustness of quantifying mediating effects of genetically regulated expression on complex traits with mediated expression score regression.Biol Methods Protoc. 2023 Oct 17;8(1):bpad024. doi: 10.1093/biomethods/bpad024. eCollection 2023. Biol Methods Protoc. 2023. PMID: 37901453 Free PMC article. Review.
-
'Candidatus Phytoplasma mali' SAP11-Like protein modulates expression of genes involved in energy production, photosynthesis, and defense in Nicotiana occidentalis leaves.BMC Plant Biol. 2024 May 13;24(1):393. doi: 10.1186/s12870-024-05087-4. BMC Plant Biol. 2024. PMID: 38741080 Free PMC article.
References
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous