This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Jul 26:2024.11.22.624040.

doi: 10.1101/2024.11.22.624040.

CNV-Finder: Streamlining Copy Number Variation Discovery

Nicole Kuznetsov^{1

2}, Kensuke Daida¹, Mary B Makarious^{1

2

3}, Bashayer Al-Mubarak⁴, Kajsa Atterling Brolin^{5

6}, Laksh Malik¹, Cedric Kouam¹, Breeana Baker¹, Raquel Real³, Kathryn Step^{7

8}, Lara M Lange^{9

10}, Lesley Wu³, Miriam Ostrozovicova^{11

12

13}, Katherine M Andersh¹, Pin-Jui Kung¹⁴, Yasser Mecheri¹⁵, Yi-Wen Tay¹⁶, Behloul Soundous Malek¹⁵, Nada Al Tassan⁴, Maria Teresa Periñan^{6

17}, Samantha Hong¹, Mathew J Koretsky^{1

2}, Lana Sargeant^{1

9

18}, Kristin Levine^{1

2}, Cornelis Blauwendraat^{1

9}, Kimberley J Billingsley¹, Sara Bandres-Ciga¹, Hampton L Leonard^{1

2

19

20}, Soraya Bardien^{7

8

21}, Huw R Morris³, Andrew B Singleton^{1

9}, Mike A Nalls^{1

2}, Dan Vitale^{1

2}; Global Parkinson’s Genetics Program (GP2)

Affiliations

¹ Center for Alzheimer's and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA.
² DataTecnica LLC, Washington, DC 20037, USA.
³ Department of Clinical and Movement Neurosciences, Queen Square Institute of Neurology, University College London, London, UK.
⁴ King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia.
⁵ Translational Neurogenetics Unit, Department of Experimental Medical Science, Lund University, Lund, Sweden.
⁶ Centre for Preventive Neurology, Wolfson Institute of Population Health, Queen Mary University of London, London, UK.
⁷ Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa.
⁸ South African Medical Research Council Centre for Tuberculosis Research, Stellenbosch University, Cape Town, South Africa.
⁹ Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD 20892, USA.
¹⁰ Institute of Neurogenetics, University of Luebeck, Luebeck, Germany.
¹¹ Department of Neurology, P.J. Safarik University, Kosice, Slovak Republic.
¹² Department of Neurology, University Hospital of L. Pasteur, Kosice, Slovak Republic.
¹³ Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London, UK.
¹⁴ Genome and Systems Biology Degree Program, National Taiwan University and Academia Sinica, Taipei, Taiwan; Division of Plastic Surgery, Department of Surgery, National Taiwan University Hospital, Taiwan.
¹⁵ Neurology Department, Dr Benbadis University Hospital, Constantine, Algeria.
¹⁶ University of Malaya, Kuala Lumpur, Malaysia.
¹⁷ Unidad de Trastornos del Movimiento, Servicio de Neurología y Neurofisiología Clínica, Instituto de Biomedicina de Sevilla, Hospital Universitario Virgen del Rocío/CSIC/Universidad de Sevilla, Seville, Spain.
¹⁸ School of Nursing, Virginia Commonwealth University, Richmond, VA, USA.
¹⁹ German Center for Neurodegenerative Diseases (DZNE), Tübingen, Germany.
²⁰ Centre for Genetic Epidemiology, Institute for Clinical Epidemiology and Applied Biometry, University of Tübingen, Tübingen, Germany.
²¹ South African Medical Research Council/Stellenbosch University Genomics of Brain Disorders Research Unit, Stellenbosch University, Cape Town, South Africa.

PMID: 39605431
PMCID: PMC11601614
DOI: 10.1101/2024.11.22.624040

CNV-Finder: Streamlining Copy Number Variation Discovery

Nicole Kuznetsov et al. bioRxiv. 2025.

[Preprint]. 2025 Jul 26:2024.11.22.624040.

doi: 10.1101/2024.11.22.624040.

Authors

Affiliations

¹ Center for Alzheimer's and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA.
² DataTecnica LLC, Washington, DC 20037, USA.
³ Department of Clinical and Movement Neurosciences, Queen Square Institute of Neurology, University College London, London, UK.
⁴ King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia.
⁵ Translational Neurogenetics Unit, Department of Experimental Medical Science, Lund University, Lund, Sweden.
⁶ Centre for Preventive Neurology, Wolfson Institute of Population Health, Queen Mary University of London, London, UK.
⁷ Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa.
⁸ South African Medical Research Council Centre for Tuberculosis Research, Stellenbosch University, Cape Town, South Africa.
⁹ Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD 20892, USA.
¹⁰ Institute of Neurogenetics, University of Luebeck, Luebeck, Germany.
¹¹ Department of Neurology, P.J. Safarik University, Kosice, Slovak Republic.
¹² Department of Neurology, University Hospital of L. Pasteur, Kosice, Slovak Republic.
¹³ Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London, UK.
¹⁴ Genome and Systems Biology Degree Program, National Taiwan University and Academia Sinica, Taipei, Taiwan; Division of Plastic Surgery, Department of Surgery, National Taiwan University Hospital, Taiwan.
¹⁵ Neurology Department, Dr Benbadis University Hospital, Constantine, Algeria.
¹⁶ University of Malaya, Kuala Lumpur, Malaysia.
¹⁷ Unidad de Trastornos del Movimiento, Servicio de Neurología y Neurofisiología Clínica, Instituto de Biomedicina de Sevilla, Hospital Universitario Virgen del Rocío/CSIC/Universidad de Sevilla, Seville, Spain.
¹⁸ School of Nursing, Virginia Commonwealth University, Richmond, VA, USA.
¹⁹ German Center for Neurodegenerative Diseases (DZNE), Tübingen, Germany.
²⁰ Centre for Genetic Epidemiology, Institute for Clinical Epidemiology and Applied Biometry, University of Tübingen, Tübingen, Germany.
²¹ South African Medical Research Council/Stellenbosch University Genomics of Brain Disorders Research Unit, Stellenbosch University, Cape Town, South Africa.

PMID: 39605431
PMCID: PMC11601614
DOI: 10.1101/2024.11.22.624040

Abstract

Copy Number Variations (CNVs) play pivotal roles in the etiology of complex diseases and are variable across diverse populations. Understanding the association between CNVs and disease susceptibility is significant in disease genetics research and often requires analysis of large sample sizes. One of the most cost-effective and scalable methods for detecting CNVs is based on normalized signal intensity values, such as Log R Ratio (LRR) and B Allele Frequency (BAF), from Illumina genotyping arrays. In this study, we present CNV-Finder, a novel pipeline integrating deep learning techniques on array data, specifically a Long Short-Term Memory (LSTM) network, to expedite the large-scale identification of CNVs within predefined genomic regions. This facilitates efficient prioritization of samples for time-consuming or costly subsequent analyses such as Multiplex Ligation-dependent Probe Amplification (MLPA), short-read, and long-read whole genome sequencing. We incorporate four genes to establish our methods-Parkin (PRKN), Leucine Rich Repeat And Ig Domain Containing 2 (LINGO2), Microtubule Associated Protein Tau (MAPT), and alpha-Synuclein (SNCA)-which may be relevant to neurological diseases such as Alzheimer's disease (AD), Parkinson's disease (PD), Progressive Supranuclear Palsy (PSP), or related disorders such as essential tremor (ET). By training our models on expert-annotated samples and validating them across diverse cohorts, including those from the Global Parkinson's Genetics Program (GP2) and additional dementia-specific databases, we demonstrate the efficacy of CNV-Finder in accurately detecting deletions and duplications. Our pipeline outputs app-compatible files for visualization within CNV-Finder's interactive web application. This interface enables researchers to review predictions and filter displayed samples by model prediction values, LRR range, and variant count in order to explore or confirm results. Our pipeline integrates this human feedback to enhance model performance and reduce false positive rates. Through a series of comprehensive analyses and validations using visual inspection, MLPA, short-read, and long-read sequencing data, we demonstrate the robustness and adaptability of CNV-Finder in identifying CNVs with regions of varied size, probe density, and noise. Our findings highlight the significance of contextual understanding and human expertise in enhancing the precision of CNV identification, particularly in complex genomic regions like 17q21.31. The CNV-Finder pipeline is a scalable, publicly available resource for the scientific community, available on GitHub (https://github.com/GP2code/CNV-Finder; DOI 10.5281/zenodo.14182563). CNV-Finder not only expedites accurate candidate identification but also significantly reduces the manual workload for researchers, enabling future targeted validation and downstream analyses in regions or phenotypes of interest.

Keywords: Copy Number Variation (CNV); Python; Structural Variant (SV); deep learning; genetics; long short-term memory (LSTM); pipeline.

PubMed Disclaimer

Figures

**Figure 1.**
Workflow to demonstrate the iterative growth of our training sets through testing and prediction reviews on additional cohorts.

**Figure 2.**
Feature creation to capture deletions (A) and duplications (B). (1) Illumina-based thresholds are applied to SNPs to determine if their LRR and BAF values fall into the proper ranges to be considered deletions and duplications. (2) Features in Supplementary Table 8 are calculated on the variants within these specified CNV ranges (CNV candidates).

**Figure 3.**
MLPA-validated *SNCA* multiplications from two monogenic families with their respective prediction values determined by CNV-Finder’s duplication model.

**Figure 4.**
Long-read WGS validation of predicted deletions in *PRKN*. Four samples (A, B, C, E) received prediction values of 1 and one sample (D) received 0.93.

**Figure 5.**
Long-read WGS validation of predicted duplications near *MAPT.* All three samples received prediction values of 1.

**Figure 6.**
Short-read WGS validation of model results with varied prediction values.

See this image and copyright information in PMC

References

1. Pös O. et al. DNA copy number variation: Main characteristics, evolutionary significance, and pathological aspects. Biomed J 44, 548–559 (2021). - PMC - PubMed
1. Lavrichenko K., Johansson S. & Jonassen I. Comprehensive characterization of copy number variation (CNV) called from array, long- and short-read data. BMC Genomics 22, 826 (2021). - PMC - PubMed
1. Lin C.-F., Naj A. C. & Wang L.-S. Analyzing copy number variation using SNP array data: protocols for calling CNV and association tests. Curr Protoc Hum Genet 79, 1.27.1–1.27.15 (2013).
1. Zhu W. et al. Heterozygous PRKN mutations are common but do not increase the risk of Parkinson’s disease. Brain 145, 2077–2091 (2022). - PMC - PubMed
1. Ahmad A., Nkosi D. & Iqbal M. A. Microdeletion or Duplications Have Been Implicated in Different Neurological Disorders Including Early Onset Parkinson Disease. Genes (Basel) 14, (2023).

Publication types

Actions

LinkOut - more resources

Full Text Sources
- Cold Spring Harbor Laboratory
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

CNV-Finder: Streamlining Copy Number Variation Discovery

Affiliations

CNV-Finder: Streamlining Copy Number Variation Discovery

Authors

Affiliations

Abstract

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Miscellaneous