Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 9;23(1):237.
doi: 10.1186/s13059-022-02803-x.

Personalized genome assembly for accurate cancer somatic mutation discovery using tumor-normal paired reference samples

Affiliations

Personalized genome assembly for accurate cancer somatic mutation discovery using tumor-normal paired reference samples

Chunlin Xiao et al. Genome Biol. .

Abstract

Background: The use of a personalized haplotype-specific genome assembly, rather than an unrelated, mosaic genome like GRCh38, as a reference for detecting the full spectrum of somatic events from cancers has long been advocated but has never been explored in tumor-normal paired samples. Here, we provide the first demonstrated use of de novo assembled personalized genome as a reference for cancer mutation detection and quantifying the effects of the reference genomes on the accuracy of somatic mutation detection.

Results: We generate de novo assemblies of the first tumor-normal paired genomes, both nuclear and mitochondrial, derived from the same individual with triple negative breast cancer. The personalized genome was chromosomal scale, haplotype phased, and annotated. We demonstrate that it provides individual specific haplotypes for complex regions and medically relevant genes. We illustrate that the personalized genome reference not only improves read alignments for both short-read and long-read sequencing data but also ameliorates the detection accuracy of somatic SNVs and SVs. We identify the equivalent somatic mutation calls between two genome references and uncover novel somatic mutations only when personalized genome assembly is used as a reference.

Conclusions: Our findings demonstrate that use of a personalized genome with individual-specific haplotypes is essential for accurate detection of the full spectrum of somatic mutations in the paired tumor-normal samples. The unique resource and methodology established in this study will be beneficial to the development of precision oncology medicine not only for breast cancer, but also for other cancers.

PubMed Disclaimer

Conflict of interest statement

CP was employed by Dovetail Genomics, LLC, and LF was employed by Roche Sequencing Solutions Inc during the course of this research. The other authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Schematic diagram of study design. Sequencing data from five different platforms were used for initial de novo assembly, assembly evaluation, scaffolding, and phasing for the normal reference sample (HCC1395BL B Lymphocyte cell line), while sequencing data from three platforms were used for de novo assembly and assembly evaluation for the tumor reference sample (HCC1395 breast cancer cell line from the same donor). The final assembled personal genome, known as HCC1395BL_v1.0, was used as reference for read mapping with both short and long reads, and assessment of somatic SNVs and SVs as compared to that using GRCh38 as reference
Fig. 2
Fig. 2
A Circos consistency plot of HCC1395BL_v1.0 (right side) against the GRCh38 reference (left side). Included are the 71 largest scaffolds with at least 2 Mbp, which accounted for 2,775,074,314 bp (95.51%). Shown here were alignments with coverage of at least 100kb and mapping quality of at least 60 on GRCh38 using minimap2. Centromeres are marked with circles on the inner circle of GRCh38 chromosomes. Black regions on the chromosomes represent GRCh38 gaps 100kb greater in size. Five chromosomes were almost completely covered by single scaffolds (Scaffold_1 for chr4, Scaffold_3 for chr8, Scaffold_8 for chr14, Scaffold_13 for chr18, and Scaffold_18 for chr20, and are colored red). Four chromosomes (chr2, chr3, chr12, and chr19) were broken only at centromere regions (covered almost completely by just two scaffolds). Centromere-crossing scaffolds are colored light blue. Scaffolds (Scaffold_5, Scaffold_10, and Scaffold_17) covering one arm are colored dark blue. Scaffolds with near full coverage of one arm are colored yellow
Fig. 3
Fig. 3
A Summary of RefSeq genes/transcripts mapping on HCC1395BL_v1.0 and GRCh38 with cutoffs 95% identity + 50% coverage versus 95% identity + 95% coverage. The bottom table provides a summary of the mapping using 95% identity + 50% coverage as cutoff. B HLA coding genes on Scaffold_30 of HCC1395BL_v1.0 in comparison to those on chromosome 6 of GRCh38 primary assembly. The haplotype of HLA-DRB (labels in red) in HCC1395BL_v1.0 consists of the HLA-DRB1 and HLA-DRB4 genes (human HLA-DR53 haplotype group), while the GRCh38 primary assembly contains HLA-DRB1 and HLA-DRB5 genes (human HLA-DR51 haplotype group). The human HLA-DR53 haplotype is represented only in GRCh38 ALT_REF_LOCI sequences. Although scaffold_30 maps onto GRCh38 entirely in reverse complement, the HLA gene order is preserved between GRCh38 and HCC1395BL_v1.0
Fig. 4
Fig. 4
Improvements of Illumina short-read and PacBio long-read mappings with personalized genome HCC1395BL_v1.0 reference as compared to GRCh38. A Percentages of properly paired reads in alignments for tumor and normal samples. B Reductions of non-properly paired reads in alignments for tumor and normal samples with personalized genome HCC1395BL_v1.0 as compared to GRCh38. C Reductions of mismatches in alignments for tumor and normal samples with personalized genome HCC1395BL_v1.0 as compared to GRCh38. D Reductions of split reads in alignments for tumor and normal samples with personalized genome HCC1395BL_v1.0 as compared to GRCh38. E Standard deviations of read coverages in short-read alignments with personalized genome HCC1395BL_v1.0 as compared to GRCh38. F Standard deviations of read coverages in PacBio long-read alignments with personalized genome HCC1395BL_v1.0 as compared to GRCh38
Fig. 5
Fig. 5
Somatic SNV detection using short reads on GRCh38 and HCC1395BL_v1.0 references. A Higher numbers of overlapping SNVs between MuTect2 and Strelka2 detected from 12 paired tumor-normal replicates on HCC1395BL_v1.0 as reference as opposed to GRCh38. B Higher number of overlapping INDELs between MuTect2 and Strelka2 detected from 12 paired tumor-normal replicates on HCC1395BL_v1.0 as reference as opposed to GRCh38. C 40,768 (97.83%) of 41,669 GRCh38-based somatic SNVs were considered mapped with HCC1395BL_v1.0-based SNVs, including 36,773 (88.25%) identical SNVs and 3995 (9.59%) equivalent SNVs. In total, 682 SNVs (1.64%) were able to map onto HCC1395BL_v1.0 but without overlapping Strelka2/MuTect2 calls. A total of 219 SNVs (0.53%) were considered as “not-mapped” onto HCC1395BL_v1.0 due to the stringent mapping criteria. D KEGG pathway enrichment analysis of 71 genes overlapped with the 1017 novel SNVs detected with HCC1395BL_v1.0 as a reference. Shown here are the top 10 enriched pathways with bar representing “odds ratio” (zScore) on the left side y-axis, dotted-line representing −log (p-value) on right side y-axis, and the numeric label showing counts of enriched gene versus the total genes in each pathway
Fig. 6
Fig. 6
Summary of somatic SV detections using tumor-normal paired short-read WGS data with HCC1395BL_v1.0 reference as compared to GRCh38. A Somatic SV counts discovered by GRIDSS2, Manta, Delly, and novoBreak with HCC1395BL_v1.0 reference as compared to GRCh38. B Somatic SV counts discovered by two or more somatic SV callers with HCC1395BL_v1.0 reference as compared to GRCh38. C 617 of 646 GRCh38-based somatic SVs (TRA excluded) were mapped to HCC1395BL_v1.0-based SVs, while 18 of 646 SVs were “unmapped,” and 11 of 646 SVs were mapped on HCC1395BL_v1.0, but no SVs in mapped locations on HCC1395BL_v1.0. D Somatic SVs supported by two or more callers were without mapped GRCh38-based SVs on HCC1395BL_v1.0. E KEGG pathway enrichment analysis of 17 genes overlapped with the 17 somatic SVs detected with HCC1395BL_v1.0 as a reference. Shown here are the top 7 enriched pathways with bar representing “odds ratio” (zScore) on the left side y-axis, dotted-line representing −log (p-value) on right side y-axis, and the numeric label showing counts of enriched gene versus the total genes in each pathway
Fig. 7
Fig. 7
Summary of somatic SVs detected in two or more replicates by four short-read callers on HCC1395BL_v1.0 as compared to GRCh38. A Counts of somatic SVs detected in two or more replicates by GRIDSS2, Manta, Delly, and novoBreak on HCC1395BL_v1.0 reference as opposed to GRCh38. B Counts of GRCh38-based somatic SVs with support from two or more replicates that were mapped to HCC1395BL_v1.0-based SVs. C Counts of GRCh38-based somatic SVs with support from two or more replicates that were unmapped or no matching SVs on the HCC1395BL_v1.0 reference (GRCh38 specific SVs). D Counts of HCC1395BL_v1.0-based somatic SVs with support from two or more replicates that had no mapped GRCh38-based SVs in corresponding locations on HCC1395BL_v1.0 reference (personalized genome-specific SVs)
Fig. 8
Fig. 8
Summary of somatic SVs detected in PacBio long-read sequencing data and assembled contigs. A Counts of somatic SVs with supports from two or more calling methods in tumor sample (HCC1395) using PacBio long reads and assembled contigs on HCC1395BL_v1.0 as compared to GRCh38 references. B Mapping 744 GRCh38-based somatic SVs that were supported by three or more calling methods onto HCC1395BL_v1.0, 531 SVs were mapped with matched SVs on HCC1395BL_v1.0, and 129 SVs were mapped but without matching SVs, whereas 84 SVs were considered “unmapped” on HCC1395BL_v1.0. C KEGG pathway enrichment analysis for 86 genes overlapped with 91 novel SVs. Shown here are the top 10 enriched pathways with “odds ratio” (zScore) and log (p-value) on the y-axis. The numeric labels are the enriched gene counts versus the total genes in each pathway. D Repeat annotation for the sequences of 72 deletions from 91 SVs that overlapped 86 gene regions using RepeatMasker showed that 32 deletions overlapped 10 classes of repeat families, among them 17 SINE/Alu, 8 simple repeats, and 3 retroposon/SVA

Similar articles

Cited by

References

    1. Berger MF, Mardis ER. The emerging clinical relevance of genomics in cancer medicine. Nat Rev Clin Oncol. 2018;15(6):353–365. - PMC - PubMed
    1. Malone ER, et al. Molecular profiling for precision cancer therapies. Genome Med. 2020;12(1):8. - PMC - PubMed
    1. Cancer Genome Atlas, N Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61–70. - PMC - PubMed
    1. Consortium, I.T.P.-C.A.o.W.G Pan-cancer analysis of whole genomes. Nature. 2020;578(7793):82–93. - PMC - PubMed
    1. Fang LT, et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat Biotechnol. 2021;39(9):1151–1160. - PMC - PubMed

Publication types