Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 25;14(1):7028.
doi: 10.1038/s41598-024-57439-7.

Extend the benchmarking indel set by manual review using the individual cell line sequencing data from the Sequencing Quality Control 2 (SEQC2) project

Affiliations

Extend the benchmarking indel set by manual review using the individual cell line sequencing data from the Sequencing Quality Control 2 (SEQC2) project

Binsheng Gong et al. Sci Rep. .

Abstract

Accurate indel calling plays an important role in precision medicine. A benchmarking indel set is essential for thoroughly evaluating the indel calling performance of bioinformatics pipelines. A reference sample with a set of known-positive variants was developed in the FDA-led Sequencing Quality Control Phase 2 (SEQC2) project, but the known indels in the known-positive set were limited. This project sought to provide an enriched set of known indels that would be more translationally relevant by focusing on additional cancer related regions. A thorough manual review process completed by 42 reviewers, two advisors, and a judging panel of three researchers significantly enriched the known indel set by an additional 516 indels. The extended benchmarking indel set has a large range of variant allele frequencies (VAFs), with 87% of them having a VAF below 20% in reference Sample A. The reference Sample A and the indel set can be used for comprehensive benchmarking of indel calling across a wider range of VAF values in the lower range. Indel length was also variable, but the majority were under 10 base pairs (bps). Most of the indels were within coding regions, with the remainder in the gene regulatory regions. Although high confidence can be derived from the robust study design and meticulous human review, this extensive indel set has not undergone orthogonal validation. The extended benchmarking indel set, along with the indels in the previously published known-positive set, was the truth set used to benchmark indel calling pipelines in a community challenge hosted on the precisionFDA platform. This benchmarking indel set and reference samples can be utilized for a comprehensive evaluation of indel calling pipelines. Additionally, the insights and solutions obtained during the manual review process can aid in improving the performance of these pipelines.

Keywords: Benchmarking; Bioinformatics; Indel; Precision medicine; Quality control.

PubMed Disclaimer

Conflict of interest statement

N.F.S. and N.N. are employees of Agilent Technologies. Other authors declare no competing interest.

Figures

Figure 1
Figure 1
(A) Reference Sample A was created by mixing equal mass of DNA samples from 10 cancer cell lines, i.e., Myeloma (B-lymphocyte, BLY), Glioblastoma (brain, BRA), Adenocarcinoma (breast, BRE), Adenocarcinoma (cervix, CRV), Liposarcoma (soft tissue, LIP), Hepatoblastoma (liver, LIV), Lymphoma (macrophage, MAC), Melanoma (skin, SKN), Carcinoma (testes, TES), Carcinoma (T-lymphoblast, TLY). Sample B was DNA sample derived from a normal male control cell line. These 11 DNA samples were sequenced using three WES panels, i.e., Roche MedExome panel (WES1), IDT xGen Exome panel (WES2), and Agilent SureSelect Exome panel (WES3). For each WES panel, two library replicates were made and sequenced. In total, 66 BAM files were obtained after alignment to hg19 reference genome. (B) Overall block diagram of the manual review process in this study. The figure illustrates the main steps of the indel manual review process.
Figure 2
Figure 2
Indel VAF distribution. (A) Binned VAF distribution by percentage of indels. Each color represents a different VAF bin. (B) Number of indel variants found across the VAF spectrum of VAF. Bar height represents more the number of indel variants found within each VAF bin. VAF bins are represented on the X-axis and are non-linear.
Figure 3
Figure 3
Length of insertions (A) and deletions (B). Bar height represents the number of indel variants found at in each that length. The x-axis represents the length of the genomic event, either insertion (A) or deletion (B).
Figure 4
Figure 4
Genomic impact of indels, broken down into frameshift, UTR, in-frame, intron/intergenic region, and others.

Similar articles

Cited by

References

    1. Yang H, Zhong Y, Peng C, Chen JQ, Tian D. Important role of indels in somatic mutations of human cancer genes. BMC Med. Genet. 2010;11:128. doi: 10.1186/1471-2350-11-128. - DOI - PMC - PubMed
    1. Yue Z, Zhao L, Cheng N, Yan H, Xia J. dbCID: A manually curated resource for exploring the driver indels in human cancer. Brief Bioinform. 2019;20:1925–1933. doi: 10.1093/bib/bby059. - DOI - PubMed
    1. Kwon T, et al. Precision targeting tumor cells using cancer-specific InDel mutations with CRISPR-Cas9. Proc. Natl. Acad. Sci. USA. 2022 doi: 10.1073/pnas.2103532119. - DOI - PMC - PubMed
    1. Baeissa HM, Pearl FMG. Identifying the impact of inframe insertions and deletions on protein function in cancer. J. Comput. Biol. 2020;27:786–795. doi: 10.1089/cmb.2018.0192. - DOI - PubMed
    1. Lin M, et al. Effects of short indels on protein structure and function in human genomes. Sci. Rep. 2017;7:9313. doi: 10.1038/s41598-017-09287-x. - DOI - PMC - PubMed