Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Nov 27;13(1):20817.
doi: 10.1038/s41598-023-48285-0.

GASOLINE: detecting germline and somatic structural variants from long-reads data

Affiliations

GASOLINE: detecting germline and somatic structural variants from long-reads data

Alberto Magi et al. Sci Rep. .

Abstract

Long-read sequencing allows analyses of single nucleic-acid molecules and produces sequences in the order of tens to hundreds kilobases. Its application to whole-genome analyses allows identification of complex genomic structural-variants (SVs) with unprecedented resolution. SV identification, however, requires complex computational methods, based on either read-depth or intra- and inter-alignment signatures approaches, which are limited by size or type of SVs. Moreover, most currently available tools only detect germline variants, thus requiring separate computation of sample pairs for comparative analyses. To overcome these limits, we developed a novel tool (Germline And SOmatic structuraL varIants detectioN and gEnotyping; GASOLINE) that groups SV signatures using a sophisticated clustering procedure based on a modified reciprocal overlap criterion, and is designed to identify germline SVs, from single samples, and somatic SVs from paired test and control samples. GASOLINE is a collection of Perl, R and Fortran codes, it analyzes aligned data in BAM format and produces VCF files with statistically significant somatic SVs. Germline or somatic analysis of 30[Formula: see text] sequencing coverage experiments requires 4-5 h with 20 threads. GASOLINE outperformed currently available methods in the detection of both germline and somatic SVs in synthetic and real long-reads datasets. Notably, when applied on a pair of metastatic melanoma and matched-normal sample, GASOLINE identified five genuine somatic SVs that were missed using five different sequencing technologies and state-of-the art SV calling approaches. Thus, GASOLINE identifies germline and somatic SVs with unprecedented accuracy and resolution, outperforming currently available state-of-the-art WGS long-reads computational methods.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
GASOLINE workflow. Panel (a) shows the steps to calculate NRO for a pair of SV signatures coordinates. Once the gap- and split-alignments coordinates (Si and Ei for i,j=[1,2]) have been extracted from each read (a1), these are used to calculate the size of non-overlapping (NO121 and NO122, blue lines) and overlapping (O12) segments for each pair of signatures (a2). NO121, NO122, O12 and the total size of the two intervals (L1 and L2) are then used to calculate the NRO12 coefficient (a3). In panel (b) is reported the workflow followed by GASOLINE for the detection of germline SVs in a sample. After signatures extraction, the tool calculates the NROij between all the signature pairs and generates an NRO matrix (b1) that is used as adjacency matrix to create an undirected graph by filtering out NROij values smaller than a predefined threshold (continuous edges represent NROij>NROthr, while dotted edges NROij<NROthr (b2). The undirected graph is then analyzed with the Eppstein–Löffler–Strash algorithm to extract maximal cliques that represent clusters of SV signatures that can be assumed to be generated from the same SV event (b3). Next, all the SV signatures of a cluster are used to estimate the genomic coordinate (orange segment) of each SV event (b4). Finally, the number of SV signatures of a cluster and the total number of reads aligned in the breakpoints are used for genotyping with a maximum-likelihood Bayesian classification algorithm (b5). In panel (c) are reported the steps that GASOLINE follows for detecting somatic SVs. Somatic SVs are identified by comparing the SV signatures of a test (cancer) sample with a control (normal) sample. The SVs detected in the test sample (c1) are compared with the SVs signatures extracted from the control sample (c2) by calculating the NRO: SV signatures with a NROSV larger than a predefined threshold are considered to be generated from the SV event of the test sample (c3). Statistical significance of each somatic SV is calculated by applying the Fisher’s exact test on the contingency table of (c4): NRRT (number of reads without SV signatures in test sample), NVRT (number of reads with the SV signatures in test sample), NRRC (number of reads without SV signatures in control sample), NVRC (number of reads with the SV signatures in control sample) . SVs with a p-value smaller than a predefined significance threshold are considered somatic.
Figure 2
Figure 2
Global performance of GASOLINE and the other three tools in the detection of synthetic and real germline SVs. Panels a-f report the F1 score obtained by the four tools in the analysis of simulated inversions (a,d), duplications (b,e) and translocations (c,f). Results are reported for ONT (ac) and PacBio (d,e) synthetic reads aligned with minimap2. Panels (gj) report precision and recall obtained by the four tools in the analysis of the NA24385 datasets for the detection of small SVs (g,h) and large SVs (i,j) with ONT (g,i) and PacBio (h,j) data. The curves in panels (gj) were obtained by ordering all the SVs as a function of number of supporting reads and calculating precision and recall including SVs with decreasing number of reads. Panels (kn) show F1 score obtained by the four tools for the NA24385 datasets in the detection of small (kl) and large SVs (mn) with ONT (k,m) and PacBio (l,n) datasets at different sequencing coverages.
Figure 3
Figure 3
Performance of GASOLINE on the detection of somatic SVs. Panels (ad) show F1 score obtained by the four tools in the detection of simulated small (a,b) and large somatic SVs (c,d) with ONT (a,c) and PacBio (b,d) datasets at different sequencing coverage. Panels (eo) report the precision-recall obtained by GASOLINE and the other three tools in the detection of deletions (e,j), insertions (f,l), duplications (g,m), inversions (h,n) and translocations (i,o) of the Valle-Inclan et al. true-set for the COLO829 cell lines sequenced with ONT (ei) and PacBio (jo) technologies. The results for GASOLINE were reported for different somatic p-value thresholds (5×10-1, 1×10-1, 5×10-2, 1×10-2, 5×10-3, 1×10-3, 5×10-4, 1×10-4). All the results reported in the panels are based on mimimap2 alignment data.

References

    1. Craddock N, Hurles ME, Cardin N, Pearson RD, et al. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature. 2010;464(7289):713–20. doi: 10.1038/nature08979. - DOI - PMC - PubMed
    1. Fahed AC, Gelb BD, Seidman JG, Seidman CE. Genetics of congenital heart disease: The glass half empty. Circ. Res. 2013;112(4):707–720. doi: 10.1161/CIRCRESAHA.112.300853. - DOI - PMC - PubMed
    1. Pippucci T, et al. Epilepsy with auditory features: A heterogeneous clinico-molecular disease. Neurol. Genet. 2015;1(1):e5. doi: 10.1212/NXG.0000000000000005. - DOI - PMC - PubMed
    1. Campbell PJ, et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat. Genet. 2008;40(6):722–9. doi: 10.1038/ng.128. - DOI - PMC - PubMed
    1. van Belzen IAEM, Schönhuth A, Kemmeren P, Hehir-Kwa JY. Structural variant detection in cancer genomes: Computational challenges and perspectives for precision oncology. NPJ Precis. Oncol. 2021;5(1):15. doi: 10.1038/s41698-021-00155-6. - DOI - PMC - PubMed