Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 29;11(1):3697.
doi: 10.1038/s41467-020-17453-5.

Detecting sample swaps in diverse NGS data types using linkage disequilibrium

Affiliations

Detecting sample swaps in diverse NGS data types using linkage disequilibrium

Nauman Javed et al. Nat Commun. .

Abstract

As the number of genomics datasets grows rapidly, sample mislabeling has become a high stakes issue. We present CrosscheckFingerprints (Crosscheck), a tool for quantifying sample-relatedness and detecting incorrectly paired sequencing datasets from different donors. Crosscheck outperforms similar methods and is effective even when data are sparse or from different assays. Application of Crosscheck to 8851 ENCODE ChIP-, RNA-, and DNase-seq datasets enabled us to identify and correct dozens of mislabeled samples and ambiguous metadata annotations, representing ~1% of ENCODE datasets.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Incorporating linkage information allows robust comparison of sequencing datasets.
a Sample swaps and misannotations, where a sample is incorrectly attributed to the wrong donor, are a high stakes issue for large consortium projects and clinical science. b Our method compares reads from two datasets across a genome-wide set of linkage disequilibrium (LD) blocks (haplotype map). The single-nucleotide polymorphisms (SNPs) in each block are highly correlated with each other and have low correlation with SNPs in other blocks. Reads overlapping any of the SNPs in a given block inform the relatedness of the datasets, even when reads from the two datasets do not overlap one another. c Haplotype maps contain many large LD blocks. LD blocks are created using common, ancestry independent SNPs from 1000 Genomes. Most SNPs lie within blocks of size >2, which boosts the chances of reads to be informative. d Distribution of LOD (log-odds ratio) scores for 34,336 donor-mismatched (red) and 9767 donor-matched pairs (green) of public ChIP-, RNA-, and DNase-seq datasets from the ENCODE project. e LD-based method can correctly determine sample relatedness even at low sequencing coverage. Pairwise comparisons of reference dataset pairs at different subsampling percentages using two equally sized SNP panels—one panel contained only independent single SNPs, while the other contained only LD blocks. Donor-mismatched dataset pairs are colored red while donor-matched dataset pairs are green. f Comparison of NGSC and Crosscheck’s classification of 34,336 donor-mismatched and 9767 donor-matched dataset pairs. Performance was measured in terms of the false flag rate (FFR), the fraction of donor-matched pairs incorrectly flagged as donor mismatches, and the false-match rate (FMR), the fraction of donor-mismatched pairs incorrectly identified as donor matches. Comparisons are classified as same-assay if the two datasets are from the same-assay type, and have the same target epitope in the case of ChIP-seq datasets. All other comparisons are classified as cross-assay. (Elements of (a and b) have been modified from a CDC publication (https://commons.wikimedia.org/wiki/File:Access_to_Health_Care-CDC_Vital_Signs-November_2010.pdf) which is under a CC BY-SA licence: https://creativecommons.org/licenses/by-sa/4.0/deed.en.)
Fig. 2
Fig. 2. Overview of ENCODE database swap detection.
a Overview of 8851 genotyped datasets from ENCODE, partitioned by cell type (top left), assay type (top right), and by target for ChIP-seq (bottom). Cell types that had less than 100 datasets derived from them were pooled—so all the datasets from them are grouped into those with less than 30 datasets or those with 30-100 datasets. All hg19 aligned reads from total RNA-, polyA RNA-, ChIP-, and DNase-seq experiments performed on samples belonging to donors with at least four datasets in total were included in the analysis. All ChIP-seq targets, including histone modifications (HM), transcription factors (TF), chromatin modifiers (CM), CTCF, and control experiments were included. b Distribution of LOD scores from ENCODE genotyping. Each dataset was compared to three representative datasets from its nominal donor. Any dataset scoring negatively against any of the three representatives was flagged for further review. A comparison resulting in an LOD score between −5 and 5 was deemed inconclusive (insufficient evidence to indicate shared or distinct genetic origin). c Each flagged sample was compared to all other samples from its nominal donor, as well as the representatives for all other donors in our database to nominate true donor identity and identify genetically consistent sub-clusters. Comparisons of flagged samples between two HUVEC donors reveal five genetically distinct clusters.

References

    1. Horbach SPJM, Halffman W. The ghosts of HeLa: how cell line misidentification contaminates the scientific literature. PLoS ONE. 2017;12:e0186281. doi: 10.1371/journal.pone.0186281. - DOI - PMC - PubMed
    1. Lorsch JR, Collins FS, Lippincott-Schwartz J. Fixing problems with cell lines. Science. 2014;346:1452–1453. doi: 10.1126/science.1259110. - DOI - PMC - PubMed
    1. Biankin AV, Piantadosi S, Hollingsworth SJ. Patient-centric trials for therapeutic development in precision oncology. Nature. 2015;526:361–370. doi: 10.1038/nature15819. - DOI - PubMed
    1. Danecek P, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. - DOI - PMC - PubMed
    1. Pengelly RJ, et al. A SNP profiling panel for sample tracking in whole-exome sequencing studies. Genome Med. 2013;5:89. doi: 10.1186/gm492. - DOI - PMC - PubMed

Publication types