Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul;43(7):1177-1191.
doi: 10.1038/s41587-024-02382-1. Epub 2024 Oct 25.

Comprehensive genome analysis and variant detection at scale using DRAGEN

Affiliations

Comprehensive genome analysis and variant detection at scale using DRAGEN

Sairam Behera et al. Nat Biotechnol. 2025 Jul.

Abstract

Research and medical genomics require comprehensive, scalable methods for the discovery of novel disease targets, evolutionary drivers and genetic markers with clinical significance. This necessitates a framework to identify all types of variants independent of their size or location. Here we present DRAGEN, which uses multigenome mapping with pangenome references, hardware acceleration and machine learning-based variant detection to provide insights into individual genomes, with ~30 min of computation time from raw reads to variant detection. DRAGEN outperforms current state-of-the-art methods in speed and accuracy across all variant types (single-nucleotide variations, insertions or deletions, short tandem repeats, structural variations and copy number variations) and incorporates specialized methods for analysis of medically relevant genes. We demonstrate the performance of DRAGEN across 3,202 whole-genome sequencing datasets by generating fully genotyped multisample variant call format files and demonstrate its scalability, accuracy and innovation to further advance the integration of comprehensive genomics. Overall, DRAGEN marks a major milestone in sequencing data analysis and will provide insights across various diseases, including Mendelian and rare diseases, with a highly comprehensive and scalable platform.

PubMed Disclaimer

Conflict of interest statement

Competing interests: F.J.S. receives research support from Genentech, Illumina, PacBio and ONT. S.C., M.R., S.T., Z.H., M.R., A.V., G.P., C.R., A.F., V.O., S.M., J.H. and R.M. are employees of Illumina. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the DRAGEN variant calling pipeline.
ag, DRAGEN improves variant identification from a single base pair to multiple megabase pairs of alleles. This is achieved by implementing multiple optimized concepts. a, Mapping uses a pangenome reference including 64 haplotypes. b, SV calling is substantially improved over local assemblies based on breakpoint graphs; Chr, chromosome; DEL, deletion; DUP, duplication; INS, insertion; INV, inversion; BND, breakend (or breakpoint). c, SNV calling is improved using multiple strategies, including machine learning-based scoring and filtering. d, CNV calling uses the multigenome mapping and the SV calling information to make informed decisions; CN, copy number. e, An additional nine tools targeting specific difficult regions of the genome are included, four of which have not been previously reported; Hap, haplotype; Prop., proportion. f, STR calling is integrated based on ExpansionHunter. g, A gVCF genotyper implementation to provide a population-level fully genotyped VCF file; msVCF, multisample VCF.
Fig. 2
Fig. 2. Performance overview of DRAGEN based on GIAB benchmarks.
a, Length distribution of small and large variants discovered by DRAGEN (bin sizes used for the plot (from left to right) are 500, 250,150, 50, 150, 250 and 500). b, SNV comparison based on GIAB SNV v.4.2.1. c, SNV call comparisons based on CMRG v.1.0. d, Comparison of SV call performance (insertion and deletion types) based on GIAB SV v.0.6. e, Comparison of CMRG SV call performance (insertion and deletion types) based on GIAB CMRG SV v.1.0. f, CNV caller comparison of DRAGEN compared to CNVnator across different sizes of deletions based on GIAB SV v.0.6. g, Benchmarking of STRs using GIAB v.1.0 and the DRAGEN-specific STR caller. The benchmarking results of the DRAGEN small variant caller are represented in light blue (middle). The recall and F-measure scores were calculated using the GIAB catalog, and the recall* and F-measure* were calculated using the individual catalogs of DRAGEN and GangSTR. Results from Truvari comparisons against tandem repeat benchmarks displayed in the figure are restricted to indels of ≥5 bp (default).
Fig. 3
Fig. 3. Performance overview of DRAGEN for samples HG001–HG007.
a, Length distribution of different variants for all samples (bin sizes used for the plot from left to right are 500, 250, 150, 50, 150, 250 and 500). b, Recall, precision and F-measures of DRAGEN for samples HG001–HG007. c, Comparison of false-negative (FN) and false-positive (FP) numbers among GATK and DeepVariant (DV) with BWA, DeepVariant with Giraffe and DRAGEN (DRAGEN 4.2) for HG001–HG007 SNV calls. d, Comparison of recall, precision and F-measures of samples HG001–HG007 for four different tools, that is, DRAGEN, GATK and DeepVariant with BWA and Giraffe with DeepVariant. The box plots display the minimum, maximum, median and spread of the middle 50% of the data (the interquartile range (IQR)), with whiskers indicating the range of the data within 1.5× the IQR and points beyond the whiskers representing outliers. e, Average F-measures and errors (false positives and false negatives) for different tools.
Fig. 4
Fig. 4. DRAGEN SNV calls for the 1kGP sample.
a, PCA plot of principal component 1 (PC1) and PC2 for SNVs across the 1kGP population. b, Distribution of SNV counts. ASW, African Ancestry in South-West USA; ACB, African Caribbean in Barbados; BEB, Bengali in Bangladesh; GBR, British from England and Scotland; CDX, Chinese Dai in Xishuangbanna, China; CLM, Colombian in Medellín, Colombia; ESN, Esan in Nigeria; FIN, Finnish in Finland; GWD, Gambian in Western Division – Mandinka; GIH, Gujarati Indians in Houston, Texas, USA; CHB, Han Chinese in Beijing, China; CHS, Han Chinese South; IBS, Iberian populations in Spain; ITU, Indian Telugu in the UK; JPT, Japanese in Tokyo, Japan; KHV, Kinh in Ho Chi Minh City, Vietnam; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; MXL, Mexican Ancestry in Los Angeles, CA, USA; PEL, Peruvian in Lima Peru; PUR, Puerto Rican in Puerto Rico; PJL, Punjabi in Lahore, Pakistan; STU, Sri Lankan Tamil in the UK; TSI, Toscani in Italy; YRI, Yoruba in Ibadan, Nigeria. c, Distribution of indel counts at the superpopulation level of 3,202 1kGP samples. The box plots display the minimum, maximum, median and spread of the middle 50% of the data (the interquartile range (IQR)), with whiskers indicating the range of the data within 1.5× the IQR and points beyond the whiskers representing outliers. d,e, Singleton (allele count (AC) = 1), rare (allele frequency (AF) ≤ 1%) and common variant (allele frequency > 1%) counts of GATK v.4.1 and DRAGEN v.4.2 callsets of SNVs (d) and indels (e) across the cohort level. The known and novel variants are based on the dbSNP 155 database. f, Distribution of SNVs based on their functional annotations shown on the top and bottom showing the fraction of known and novel variants; miRNA, microRNA; UTR, untranslated region. g, Distribution of small indels based on their functional annotations. NMD, nonsense-mediated decay.
Fig. 5
Fig. 5. DRAGEN SV calls for the 1kGP sample.
a, PCA of merged STRs, SVs and CNVs of 3,202 1kGP samples for deletions with a minor allele frequency of ≥5%. b,c, Distributions of insertion- (b) and deletion-type (c) SVs (≥50 bp) across 3,202 1kGP samples. d, Distribution of SVs, STRs and CNVs based on average count, that is, total variants of a population/population count; TRA, translocations. e, Distribution of variant numbers among all 3,202 samples for the 12 CMRG regions (in GRCh38) that are impacted due to false duplication and falsely collapsed errors. DRAGEN uses the corrected reference as a part of its multigenome approach to correctly identify more variants in duplicated and collapsed regions. f, Class I HLA allele frequency distributions among 3,202 1kGP populations. The box plots in b, c, e and f display the minimum, maximum, median and spread of the middle 50% of the data (the IQR), with whiskers indicating the range of the data within 1.5× the IQR and points beyond the whiskers representing outliers.

Similar articles

Cited by

  • Genetic Variant Analyses Identify Novel Candidate Autism Risk Genes from a Highly Consanguineous Cohort of 104 Families from Oman.
    Gupta V, Ben-Mahmoud A, Idris AB, Hottenga JJ, Habbab W, Alsayegh A, Kim HG, Al-Mamari W, Stanton LW. Gupta V, et al. Int J Mol Sci. 2024 Dec 21;25(24):13700. doi: 10.3390/ijms252413700. Int J Mol Sci. 2024. PMID: 39769462 Free PMC article.
  • Unravelling mutational signatures with plasma circulating tumour DNA.
    Hollizeck S, Wang N, Wong SQ, Litchfield C, Guinto J, Ftouni S, Rebello R, Kanwal S, Dong R, Grimmond S, Sandhu S, Mileshkin L, Tothill RW, Chandrananda D, Dawson SJ. Hollizeck S, et al. Nat Commun. 2024 Nov 14;15(1):9876. doi: 10.1038/s41467-024-54193-2. Nat Commun. 2024. PMID: 39543119 Free PMC article.
  • Isoform analysis of heterozygous putative splicing variants at the allele level using nanopore long-read sequencing.
    Ozaki K, Irioka T, Noma S, Machida A, Fukunaga M, Murano T, Takahashi C, Tagami M, Kawashima T, Hirata T, Yasuoka Y, Kuwahara H, Araki T, Yagi K, Mizusawa H, Ishikawa K, Okazaki Y, Yokota T. Ozaki K, et al. Sci Rep. 2025 Aug 8;15(1):29001. doi: 10.1038/s41598-025-14566-z. Sci Rep. 2025. PMID: 40781467 Free PMC article.
  • Development and extensive sequencing of a broadly-consented Genome in a Bottle matched tumor-normal pair.
    McDaniel JH, Patel V, Olson ND, He HJ, He Z, Cole KD, Gooden AA, Schmitt A, Sikkink K, Sedlazeck FJ, Doddapaneni H, Jhangiani SN, Muzny DM, Gingras MC, Mehta H, Behera S, Paulin LF, Hastie AR, Yu HC, Weigman V, Rojas A, Kennedy K, Remington J, Salas-González I, Sudkamp M, Wiseman K, Lajoie BR, Levy S, Jain M, Akeson S, Narzisi G, Steinsnyder Z, Reeves C, Shelton J, Kingan SB, Lambert C, Bayabyan P, Wenger AM, McLaughlin IJ, Adamson A, Kingsley C, Wescott M, Kim Y, Paten B, Park J, Violich I, Miga KH, Gardner J, McNulty B, Rosen GL, McCoy R, Brundu F, Sayyari E, Scheffler K, Truong S, Catreux S, Hannah LC, Lipson D, Benjamin H, Iremadze N, Soifer I, Krieger G, Eacker S, Wood M, Cross E, Husar G, Gross S, Vernich M, Kolmogorov M, Ahmad T, Keskus A, Bryant A, Thibaud-Nissen F, Trow J, Proszynski J, Hirschberg JW, Ryon K, Mason CE, Bhakta MS, Zachary Sanborn J, Munding EM, Wagner J, Xiao C, Liss AS, Zook JM. McDaniel JH, et al. bioRxiv [Preprint]. 2025 Jun 14:2024.09.18.613544. doi: 10.1101/2024.09.18.613544. bioRxiv. 2025. Update in: Sci Data. 2025 Jul 16;12(1):1195. doi: 10.1038/s41597-025-05438-2. PMID: 39345378 Free PMC article. Updated. Preprint.
  • Development and extensive sequencing of a broadly-consented Genome in a Bottle matched tumor-normal pair.
    McDaniel JH, Patel V, Olson ND, He HJ, He Z, Cole KD, Gooden AA, Schmitt A, Sikkink K, Sedlazeck FJ, Doddapaneni H, Jhangiani SN, Muzny DM, Gingras MC, Mehta H, Behera S, Paulin LF, Hastie AR, Yu HC, Weigman V, Rojas A, Kennedy K, Remington J, Salas-González I, Sudkamp M, Wiseman K, Lajoie BR, Levy S, Jain M, Akeson S, Narzisi G, Steinsnyder Z, Reeves C, Shelton J, Kingan SB, Lambert C, Baybayan P, Wenger AM, McLaughlin IJ, Adamson A, Kingsley C, Wescott M, Kim Y, Paten B, Park J, Violich I, Miga KH, Gardner J, McNulty B, Rosen GL, McCoy R, Brundu F, Sayyari E, Scheffler K, Truong S, Catreux S, Hannah LC, Lipson D, Benjamin H, Iremadze N, Soifer I, Krieger G, Eacker S, Wood M, Cross E, Husar G, Gross S, Vernich M, Kolmogorov M, Ahmad T, Keskus AG, Bryant A, Thibaud-Nissen F, Trow J, Proszynski J, Hirschberg JW, Ryon K, Mason CE, Bhakta MS, Sanborn JZ, Munding EM, Wagner J, Xiao C, Liss AS, Zook JM. McDaniel JH, et al. Sci Data. 2025 Jul 16;12(1):1195. doi: 10.1038/s41597-025-05438-2. Sci Data. 2025. PMID: 40670386 Free PMC article.

References

    1. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet.17, 333–351 (2016). - PMC - PubMed
    1. Zhang, J., Chiodini, R., Badr, A. & Zhang, G. The impact of next-generation sequencing on genomics. J. Genet. Genomics38, 95–109 (2011). - PMC - PubMed
    1. Tarailo-Graovac, M., Wasserman, W. W. & Van Karnebeek, C. D. M. Impact of next-generation sequencing on diagnosis and management of neurometabolic disorders: current advances and future perspectives. Expert Rev. Mol. Diagn.17, 307–309 (2017). - PubMed
    1. Satam, H. et al. Next-generation sequencing technology: current trends and advancements. Biology12, 997 (2023). - PMC - PubMed
    1. Coster, W. D., De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet.22, 572–587 (2021). - PMC - PubMed

LinkOut - more resources