This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Jan 7:2024.12.19.629494.

doi: 10.1101/2024.12.19.629494.

HapCNV: A Comprehensive Framework for CNV Detection in Low-input DNA Sequencing Data

Xuanxuan Yu¹, Fei Qin², Shiwei Liu³, Noah J Brown⁴, Qing Lu⁵, Guoshuai Cai⁶, Jennifer L Guler⁴, Feifei Xiao⁵

Affiliations

¹ Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC, USA.
² Division of Cancer Epidemiology and Genetics, National Cancer Institute, 9609 Medical Center Drive, Rockville, MD, 20850, USA.
³ Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, Indiana, USA.
⁴ Department of Biology, University of Virginia, Charlottesville, VA, USA.
⁵ Department of Biostatistics, College of Public Health and Health Promotions & College of Medicine, University of Florida, Gainesville, FL, USA.
⁶ Department of Surgery, College of Medicine, University of Florida, Gainesville, FL, USA.

PMID: 39763944
PMCID: PMC11702719
DOI: 10.1101/2024.12.19.629494

HapCNV: A Comprehensive Framework for CNV Detection in Low-input DNA Sequencing Data

Xuanxuan Yu et al. bioRxiv. 2025.

[Preprint]. 2025 Jan 7:2024.12.19.629494.

doi: 10.1101/2024.12.19.629494.

Authors

Xuanxuan Yu¹, Fei Qin², Shiwei Liu³, Noah J Brown⁴, Qing Lu⁵, Guoshuai Cai⁶, Jennifer L Guler⁴, Feifei Xiao⁵

Affiliations

¹ Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC, USA.
² Division of Cancer Epidemiology and Genetics, National Cancer Institute, 9609 Medical Center Drive, Rockville, MD, 20850, USA.
³ Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, Indiana, USA.
⁴ Department of Biology, University of Virginia, Charlottesville, VA, USA.
⁵ Department of Biostatistics, College of Public Health and Health Promotions & College of Medicine, University of Florida, Gainesville, FL, USA.
⁶ Department of Surgery, College of Medicine, University of Florida, Gainesville, FL, USA.

PMID: 39763944
PMCID: PMC11702719
DOI: 10.1101/2024.12.19.629494

Abstract

Copy number variants (CNVs) are prevalent in both diploid and haploid genomes, with the latter containing a single copy of each gene. Studying CNVs in genomes from single or few cells is significantly advancing our knowledge in human disorders and disease susceptibility. Low-input including low-cell and single-cell sequencing data for haploid and diploid organisms generally displays shallow and highly non-uniform read counts resulting from the whole genome amplification steps that introduce amplification biases. In addition, haploid organisms typically possess relatively short genomes and require a higher degree of DNA amplification compared to diploid organisms. However, most CNV detection methods are specifically developed for diploid genomes without specific consideration of effects on haploid genomes. Challenges also reside in reference samples or normal controls which are used to provide baseline signals for defining copy number losses or gains. In traditional methods, references are usually pre-specified from cells that are assumed to be normal or disease-free. However, the use of pre-defined reference cells can bias results if common CNVs are present. Here, we present the development of a comprehensive statistical framework for data normalization and CNV detection in haploid single- or low-cell DNA sequencing data called HapCNV. The prominent advancement is the construction of a novel genomic location specific pseudo-reference that selects unbiased references using a preliminary cell clustering method. This approach effectively preserves common CNVs. Using simulations, we demonstrated that HapCNV outperformed existing methods by generating more accurate CNV detection, especially for short CNVs. Superior performance of HapCNV was also validated in detecting known CNVs in a real P. falciparum parasite dataset. In conclusion, HapCNV provides a novel and useful approach for CNV detection in haploid low-input sequencing datasets, with easy applicability to diploids.

Keywords: Copy number variation; Haploid; Low-input sequencing; Pseudo-reference sequence; Single-cell DNA sequencing.

PubMed Disclaimer

Conflict of interest statement

CONFLICT OF INTEREST STATEMENT The authors declare no conflicts of interest.

Figures

**Fig. 1. Flowchart of HapCNV.**
A: Low-cell or single-cell DNA-seq raw data matrix with rows representing genomic markers or bins, and columns denoting cells. B: Read count matrix after quality control and correction of biases introduced by GC content and mappability. C: Pseudo-reference construction: the left panel shows the results of cell clustering, where three clusters are identified with shared CNVs detected within clusters; the right panel shows the reference sequence constructed by utilizing information from cells marked with red stars. For example, when constructing reference sequence for the first cell per bin, other cells identified to be normal states will be used as the references. Specifically, for the first bin, the two star-labeled normal cells will serve as the reference, the median of which will be used to normalize the read count of the first cell. We considered three copy number states: 2-duplications; 1-normal states; 0-deletions. D: The logarithm transformation is conducted after the normalized matrix is calculated by dividing the read counts of each tested cell over the median of the read counts from the pseudo-reference sequences. An additional smoothing step is further conducted to remove outliers in the intensity data. E: Segmentation and CNV clustering are conducted using Circular Binary Segmentation (CBS) method and Gaussian Mixture Model (GMM), separately.

**Fig. 2. Performance evaluation of HapCNV using F1 score in comparison with existing methods for simulated data without amplification biases.**
For the simulations, we randomly shuffled the locations of the genomic biomarkers prior to spiking in the CNV signals to remove the effect that amplification biases may bring to the evaluation. Signals for 118 cells were simulated for three different CNV states: deletion of a single copy (DEL), duplication of a single copy (DUP), and mixture of two CNV states (MIXED) with varied length. The CNV proportion ranges from 5% to 90%.

**Fig. 3. Performance evaluation of HapCNV using F1 score in comparison with existing methods for simulated data with amplification biases.**
For the simulations, we remained the locations of the genomic biomarkers prior to spiking in the CNV signals. Signals for 118 cells were simulated for three different CNV states: deletion of a single copy (DEL), duplication of a single copy (DUP), and mixture of two CNV states (MIXED) with varied length. The CNV proportion ranges from 5% to 90%.

See this image and copyright information in PMC

References

1. Tralamazza S.M., et al. , Copy number variation introduced by a massive mobile element facilitates global thermal adaptation in a fungal wheat pathogen. Nat Commun, 2024. 15(1): p. 5728. - PMC - PubMed
1. Gschwind A.R., et al. , Diversity and regulatory impact of copy number variation in the primate Macaca fascicularis. BMC Genomics, 2017. 18(1): p. 144. - PMC - PubMed
1. Guryev V., et al. , Distribution and functional impact of DNA copy number variation in the rat. Nat Genet, 2008. 40(5): p. 538–45. - PubMed
1. Pereira K.M.C., et al. , Impact of C4, C4A and C4B gene copy number variation in the susceptibility, phenotype and progression of systemic lupus erythematosus. Adv Rheumatol, 2019. 59(1): p. 36. - PubMed
1. Iskow R.C., Gokcumen O., and Lee C., Exploring the role of copy number variants in human adaptation. Trends Genet, 2012. 28(6): p. 245–57. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

HapCNV: A Comprehensive Framework for CNV Detection in Low-input DNA Sequencing Data

Affiliations

HapCNV: A Comprehensive Framework for CNV Detection in Low-input DNA Sequencing Data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources