Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jul 26:2024.11.22.624040.
doi: 10.1101/2024.11.22.624040.

CNV-Finder: Streamlining Copy Number Variation Discovery

Affiliations

CNV-Finder: Streamlining Copy Number Variation Discovery

Nicole Kuznetsov et al. bioRxiv. .

Abstract

Copy Number Variations (CNVs) play pivotal roles in the etiology of complex diseases and are variable across diverse populations. Understanding the association between CNVs and disease susceptibility is significant in disease genetics research and often requires analysis of large sample sizes. One of the most cost-effective and scalable methods for detecting CNVs is based on normalized signal intensity values, such as Log R Ratio (LRR) and B Allele Frequency (BAF), from Illumina genotyping arrays. In this study, we present CNV-Finder, a novel pipeline integrating deep learning techniques on array data, specifically a Long Short-Term Memory (LSTM) network, to expedite the large-scale identification of CNVs within predefined genomic regions. This facilitates efficient prioritization of samples for time-consuming or costly subsequent analyses such as Multiplex Ligation-dependent Probe Amplification (MLPA), short-read, and long-read whole genome sequencing. We incorporate four genes to establish our methods-Parkin (PRKN), Leucine Rich Repeat And Ig Domain Containing 2 (LINGO2), Microtubule Associated Protein Tau (MAPT), and alpha-Synuclein (SNCA)-which may be relevant to neurological diseases such as Alzheimer's disease (AD), Parkinson's disease (PD), Progressive Supranuclear Palsy (PSP), or related disorders such as essential tremor (ET). By training our models on expert-annotated samples and validating them across diverse cohorts, including those from the Global Parkinson's Genetics Program (GP2) and additional dementia-specific databases, we demonstrate the efficacy of CNV-Finder in accurately detecting deletions and duplications. Our pipeline outputs app-compatible files for visualization within CNV-Finder's interactive web application. This interface enables researchers to review predictions and filter displayed samples by model prediction values, LRR range, and variant count in order to explore or confirm results. Our pipeline integrates this human feedback to enhance model performance and reduce false positive rates. Through a series of comprehensive analyses and validations using visual inspection, MLPA, short-read, and long-read sequencing data, we demonstrate the robustness and adaptability of CNV-Finder in identifying CNVs with regions of varied size, probe density, and noise. Our findings highlight the significance of contextual understanding and human expertise in enhancing the precision of CNV identification, particularly in complex genomic regions like 17q21.31. The CNV-Finder pipeline is a scalable, publicly available resource for the scientific community, available on GitHub (https://github.com/GP2code/CNV-Finder; DOI 10.5281/zenodo.14182563). CNV-Finder not only expedites accurate candidate identification but also significantly reduces the manual workload for researchers, enabling future targeted validation and downstream analyses in regions or phenotypes of interest.

Keywords: Copy Number Variation (CNV); Python; Structural Variant (SV); deep learning; genetics; long short-term memory (LSTM); pipeline.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Workflow to demonstrate the iterative growth of our training sets through testing and prediction reviews on additional cohorts.
Figure 2.
Figure 2.
Feature creation to capture deletions (A) and duplications (B). (1) Illumina-based thresholds are applied to SNPs to determine if their LRR and BAF values fall into the proper ranges to be considered deletions and duplications. (2) Features in Supplementary Table 8 are calculated on the variants within these specified CNV ranges (CNV candidates).
Figure 3.
Figure 3.
MLPA-validated SNCA multiplications from two monogenic families with their respective prediction values determined by CNV-Finder’s duplication model.
Figure 4.
Figure 4.
Long-read WGS validation of predicted deletions in PRKN. Four samples (A, B, C, E) received prediction values of 1 and one sample (D) received 0.93.
Figure 5.
Figure 5.
Long-read WGS validation of predicted duplications near MAPT. All three samples received prediction values of 1.
Figure 6.
Figure 6.
Short-read WGS validation of model results with varied prediction values.

References

    1. Pös O. et al. DNA copy number variation: Main characteristics, evolutionary significance, and pathological aspects. Biomed J 44, 548–559 (2021). - PMC - PubMed
    1. Lavrichenko K., Johansson S. & Jonassen I. Comprehensive characterization of copy number variation (CNV) called from array, long- and short-read data. BMC Genomics 22, 826 (2021). - PMC - PubMed
    1. Lin C.-F., Naj A. C. & Wang L.-S. Analyzing copy number variation using SNP array data: protocols for calling CNV and association tests. Curr Protoc Hum Genet 79, 1.27.1–1.27.15 (2013).
    1. Zhu W. et al. Heterozygous PRKN mutations are common but do not increase the risk of Parkinson’s disease. Brain 145, 2077–2091 (2022). - PMC - PubMed
    1. Ahmad A., Nkosi D. & Iqbal M. A. Microdeletion or Duplications Have Been Implicated in Different Neurological Disorders Including Early Onset Parkinson Disease. Genes (Basel) 14, (2023).

Publication types

LinkOut - more resources