Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jan 7:2024.12.19.629494.
doi: 10.1101/2024.12.19.629494.

HapCNV: A Comprehensive Framework for CNV Detection in Low-input DNA Sequencing Data

Affiliations

HapCNV: A Comprehensive Framework for CNV Detection in Low-input DNA Sequencing Data

Xuanxuan Yu et al. bioRxiv. .

Abstract

Copy number variants (CNVs) are prevalent in both diploid and haploid genomes, with the latter containing a single copy of each gene. Studying CNVs in genomes from single or few cells is significantly advancing our knowledge in human disorders and disease susceptibility. Low-input including low-cell and single-cell sequencing data for haploid and diploid organisms generally displays shallow and highly non-uniform read counts resulting from the whole genome amplification steps that introduce amplification biases. In addition, haploid organisms typically possess relatively short genomes and require a higher degree of DNA amplification compared to diploid organisms. However, most CNV detection methods are specifically developed for diploid genomes without specific consideration of effects on haploid genomes. Challenges also reside in reference samples or normal controls which are used to provide baseline signals for defining copy number losses or gains. In traditional methods, references are usually pre-specified from cells that are assumed to be normal or disease-free. However, the use of pre-defined reference cells can bias results if common CNVs are present. Here, we present the development of a comprehensive statistical framework for data normalization and CNV detection in haploid single- or low-cell DNA sequencing data called HapCNV. The prominent advancement is the construction of a novel genomic location specific pseudo-reference that selects unbiased references using a preliminary cell clustering method. This approach effectively preserves common CNVs. Using simulations, we demonstrated that HapCNV outperformed existing methods by generating more accurate CNV detection, especially for short CNVs. Superior performance of HapCNV was also validated in detecting known CNVs in a real P. falciparum parasite dataset. In conclusion, HapCNV provides a novel and useful approach for CNV detection in haploid low-input sequencing datasets, with easy applicability to diploids.

Keywords: Copy number variation; Haploid; Low-input sequencing; Pseudo-reference sequence; Single-cell DNA sequencing.

PubMed Disclaimer

Conflict of interest statement

CONFLICT OF INTEREST STATEMENT The authors declare no conflicts of interest.

Figures

Fig. 1
Fig. 1. Flowchart of HapCNV.
A: Low-cell or single-cell DNA-seq raw data matrix with rows representing genomic markers or bins, and columns denoting cells. B: Read count matrix after quality control and correction of biases introduced by GC content and mappability. C: Pseudo-reference construction: the left panel shows the results of cell clustering, where three clusters are identified with shared CNVs detected within clusters; the right panel shows the reference sequence constructed by utilizing information from cells marked with red stars. For example, when constructing reference sequence for the first cell per bin, other cells identified to be normal states will be used as the references. Specifically, for the first bin, the two star-labeled normal cells will serve as the reference, the median of which will be used to normalize the read count of the first cell. We considered three copy number states: 2-duplications; 1-normal states; 0-deletions. D: The logarithm transformation is conducted after the normalized matrix is calculated by dividing the read counts of each tested cell over the median of the read counts from the pseudo-reference sequences. An additional smoothing step is further conducted to remove outliers in the intensity data. E: Segmentation and CNV clustering are conducted using Circular Binary Segmentation (CBS) method and Gaussian Mixture Model (GMM), separately.
Fig. 2
Fig. 2. Performance evaluation of HapCNV using F1 score in comparison with existing methods for simulated data without amplification biases.
For the simulations, we randomly shuffled the locations of the genomic biomarkers prior to spiking in the CNV signals to remove the effect that amplification biases may bring to the evaluation. Signals for 118 cells were simulated for three different CNV states: deletion of a single copy (DEL), duplication of a single copy (DUP), and mixture of two CNV states (MIXED) with varied length. The CNV proportion ranges from 5% to 90%.
Fig. 3
Fig. 3. Performance evaluation of HapCNV using F1 score in comparison with existing methods for simulated data with amplification biases.
For the simulations, we remained the locations of the genomic biomarkers prior to spiking in the CNV signals. Signals for 118 cells were simulated for three different CNV states: deletion of a single copy (DEL), duplication of a single copy (DUP), and mixture of two CNV states (MIXED) with varied length. The CNV proportion ranges from 5% to 90%.

Similar articles

References

    1. Tralamazza S.M., et al., Copy number variation introduced by a massive mobile element facilitates global thermal adaptation in a fungal wheat pathogen. Nat Commun, 2024. 15(1): p. 5728. - PMC - PubMed
    1. Gschwind A.R., et al., Diversity and regulatory impact of copy number variation in the primate Macaca fascicularis. BMC Genomics, 2017. 18(1): p. 144. - PMC - PubMed
    1. Guryev V., et al., Distribution and functional impact of DNA copy number variation in the rat. Nat Genet, 2008. 40(5): p. 538–45. - PubMed
    1. Pereira K.M.C., et al., Impact of C4, C4A and C4B gene copy number variation in the susceptibility, phenotype and progression of systemic lupus erythematosus. Adv Rheumatol, 2019. 59(1): p. 36. - PubMed
    1. Iskow R.C., Gokcumen O., and Lee C., Exploring the role of copy number variants in human adaptation. Trends Genet, 2012. 28(6): p. 245–57. - PMC - PubMed

Publication types

LinkOut - more resources