. 2021 Nov 17;22(1):826.

doi: 10.1186/s12864-021-08082-3.

Comprehensive characterization of copy number variation (CNV) called from array, long- and short-read data

Ksenia Lavrichenko^{1

2}, Stefan Johansson^{3

4}, Inge Jonassen⁵

Affiliations

¹ Computational Biology Unit, University of Bergen, Bergen, Norway. ksenia.lavrichenko@medisin.uio.no.
² Department of Clinical Science, University of Bergen, Bergen, Norway. ksenia.lavrichenko@medisin.uio.no.
³ Department of Clinical Science, University of Bergen, Bergen, Norway.
⁴ Department of Medical Genetics, Haukeland University Hospital, Bergen, Norway.
⁵ Computational Biology Unit, University of Bergen, Bergen, Norway.

PMID: 34789167
PMCID: PMC8596897
DOI: 10.1186/s12864-021-08082-3

Comprehensive characterization of copy number variation (CNV) called from array, long- and short-read data

Ksenia Lavrichenko et al. BMC Genomics. 2021.

. 2021 Nov 17;22(1):826.

doi: 10.1186/s12864-021-08082-3.

Authors

Ksenia Lavrichenko^{1

2}, Stefan Johansson^{3

4}, Inge Jonassen⁵

Affiliations

¹ Computational Biology Unit, University of Bergen, Bergen, Norway. ksenia.lavrichenko@medisin.uio.no.
² Department of Clinical Science, University of Bergen, Bergen, Norway. ksenia.lavrichenko@medisin.uio.no.
³ Department of Clinical Science, University of Bergen, Bergen, Norway.
⁴ Department of Medical Genetics, Haukeland University Hospital, Bergen, Norway.
⁵ Computational Biology Unit, University of Bergen, Bergen, Norway.

PMID: 34789167
PMCID: PMC8596897
DOI: 10.1186/s12864-021-08082-3

Abstract

Background: SNP arrays, short- and long-read genome sequencing are genome-wide high-throughput technologies that may be used to assay copy number variants (CNVs) in a personal genome. Each of these technologies comes with its own limitations and biases, many of which are well-known, but not all of them are thoroughly quantified.

Results: We assembled an ensemble of public datasets of published CNV calls and raw data for the well-studied Genome in a Bottle individual NA12878. This assembly represents a variety of methods and pipelines used for CNV calling from array, short- and long-read technologies. We then performed cross-technology comparisons regarding their ability to call CNVs. Different from other studies, we refrained from using the golden standard. Instead, we attempted to validate the CNV calls by the raw data of each technology.

Conclusions: Our study confirms that long-read platforms enable recalling CNVs in genomic regions inaccessible to arrays or short reads. We also found that the reproducibility of a CNV by different pipelines within each technology is strongly linked to other CNV evidence measures. Importantly, the three technologies show distinct public database frequency profiles, which differ depending on what technology the database was built on.

Keywords: CNV; Genome in a Bottle; Long reads; Microarrays; Short reads.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Defining and describing the CNVRs in terms of within-technology support, read depth fold change evidence, size and long-read intrinsic score distribution. A. CNVRs are defined by outermost upstream and downstream breakpoints for a set of CNVs of the same type. The within-technology support is defined as “single” when a CNVR is derived from a single CNV call of one of the datasets and otherwise as “multi”; B. Distribution of long-read CNVRs according to their length and the long-reads score bins (in grey, green and beige) and within-technology support (x-axis); C. For each CNV call a read depth fold change (DFC) score cutoff is used to define high quality (HQ) deletion (top) and duplication (bottom) with respect to support for a given CNV in a chosen short-read alignment. CNV calls that do not meet the defined DFC score thresholds are defined as low quality (LQ); D. Density plot (counts) for deletions (indicated with negative values on the x-axis) and duplications (positive values on the x-axis), across CNV call sizes (x-axis ticks), DFC score-based quality bins (left and right panel) with arrays shown in red, short reads in blue and long reads in grey, green and beige (for the long-read score binning as in B). The “long-read score <1” (lr. <1 in gray) category is omitted in the right panel for scaling purposes. Technologies: array - array, lr - long reads, sr - short reads

**Fig. 2**
Definition of CNV loci, their composition and percentage span by CNVRs. A. CNV loci are defined by outermost breakpoints for a set of CNVRs of the same type. Depending on the CNVRs included in each technology, the resulting CNV loci boundaries will vary; left to right: all technologies CNVRs included, long-read CNVR filtered by intrinsic score >5 (boundary has changed because one long-read CNVR is no longer included due to its low score, indicated by hollow rectangle), long-read CNVR filtered by intrinsic score >1 and only HQ CNVRs included for all three technologies (boundary has changed again since now also an LQ array CNVR is not included, hollow top rectangle); B. Histogram of CNVR counts (using CNV loci for long-read score > 1 set as representative) binned by size (x-axis) and a list of supporting technologies for array, long-read and short-read, respectively; C. The between-technology support is defined as the number of technologies having a CNVR in the given CNVL (one, two or three); D. To compare sizes of constituting CNVRs for each CNV locus, the percentage span of CNV locus is calculated for each technology CNVR, e.g., length CNVR/length CNV locus × 100; E. Same CNVRs as in panel B, but visualized as proportions rather than counts. Color legend shared with panel B

**Fig. 3**
Cross-technology support relation to quality cues. A. Cross-technology data evidence collection. For each CNVR in arrays, long- and short read set, the raw data in all three technologies is assayed for evidence of support. Array signal in the probes within CNVR is compared to the flanking regions, resulting in a distance metric; Read depth fold change score is used as evidence for the short-read data; Long-read data is assayed as shown on the dot plots, for evidence of genotypes. B. Normalized density plot with the array-based support score on x-axis (the score represents the distance between LRR distribution of the probes in the CNV versus flanking regions, the larger the distance, the more support for a CNV), split by DFC score bins (High and Low Quality) and colored by technologies; C. The violin plots showing distribution of array-based support score for CNVRs grouped by support derived from assaying the long-read data, with “concordant” group denoting CNVRs, for which the long reads indicated concordance with the presence of the variant, while for the “discordant” group the was no such support from the long-read data

**Fig. 4**
CNV presence and frequencies in public databases. Within and between technology support. A. Percentage of CNVRs present (at 50% overlap) in public databases; B. Frequencies of CNVRs in public databases (at 50% overlap). DGV - Database of Genomic Variants, DDD - Deciphering Developmental Disorders database, GD - gnomAD and IMH - Ira M. Hall lab database; C. Relation of between-technology (x-axis) and within-technology (color fill) support

See this image and copyright information in PMC

References

1. Girirajan S, Campbell CD, Eichler EE. Human copy number variation and complex genetic disease. Annu Rev Genet. 2011;45:203–26. - PMC - PubMed
1. Cooper GM, Coe BP, Girirajan S, Rosenfeld JA, Vu TH, Baker C, Williams C, Stalker H, Hamid R, Hannig V, Abdel-Hamid H, Bader P, McCracken E, Niyazov D, Leppig K, Thiese H, Hummel M, Alexander N, Gorski J, Kussmann J, Shashi V, Johnson K, Rehder C, Ballif BC, Shaffer LG, Eichler EE. A copy number variation morbidity map of developmental delay. Nat Genet. 2011;43(9):838–46. - PMC - PubMed
1. Mace A, Tuke MA, Deelen P, Kristiansson K, Mattsson H, Noukas M, Sapkota Y, Schick U, Porcu E, Rueger S, McDaid AF, Porteous D, Winkler TW, Salvi E, Shrine N, Liu X, Ang WQ, Zhang W, Feitosa MF, Venturini C, van der Most PJ, Rosengren A, Wood AR, Beaumont RN, Jones SE, Ruth KS, Yaghootkar H, Tyrrell J, Havulinna AS, Boers H, Magi R, Kriebel J, Muller-Nurasyid M, Perola M, Nieminen M, Lokki ML, Kahonen M, Viikari JS, Geller F, Lahti J, Palotie A, Koponen P, Lundqvist A, Rissanen H, Bottinger EP, Afaq S, Wojczynski MK, Lenzini P, Nolte IM, Sparso T, Schupf N, Christensen K, Perls TT, Newman AB, Werge T, Snieder H, Spector TD, Chambers JC, Koskinen S, Melbye M, Raitakari OT, Lehtimaki T, Tobin MD, Wain LV, Sinisalo J, Peters A, Meitinger T, Martin NG, Wray NR, Montgomery GW, Medland SE, Swertz MA, Vartiainen E, Borodulin K, Mannisto S, Murray A, Bochud M, Jacquemont S, Rivadeneira F, Hansen TF, Oldehinkel AJ, Mangino M, Province MA, Deloukas P, Kooner JS, Freathy RM, Pennell C, Feenstra B, Strachan DP, Lettre G, Hirschhorn J, Cusi D, Heid IM, Hayward C, Mannik K, Beckmann JS, Loos RJF, Nyholt DR, Metspalu A, Eriksson JG, et al. Cnv-association meta-analysis in 191,161 european adults reveals new loci associated with anthropometric traits. Nat Commun. 2017;8(1):744. - PMC - PubMed
1. Haraksingh RR, Abyzov A, Urban AE. Comprehensive performance comparison of high-resolution array platforms for genome-wide copy number variation (cnv) analysis in humans. BMC Genomics. 2017;18(1):321. - PMC - PubMed
1. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C. Detection of large-scale variation in the human genome. Nat Genet. 2004;36(9):949–51. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comprehensive characterization of copy number variation (CNV) called from array, long- and short-read data

Affiliations

Comprehensive characterization of copy number variation (CNV) called from array, long- and short-read data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous