Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 18;26(Suppl 2):263.
doi: 10.1186/s12864-025-11398-z.

Large indel detection in region-based phased diploid assemblies from linked-reads

Affiliations

Large indel detection in region-based phased diploid assemblies from linked-reads

Can Luo et al. BMC Genomics. .

Abstract

Background: Linked-reads improve de novo assembly, haplotype phasing, structural variant (SV) detection, and other applications through highly-multiplexed genome partitioning and barcoding. Whole genome assembly and assembly-based variant detection based on linked-reads often require intensive computation costs and are not suitable for large population studies. Here we propose an efficient pipeline, RegionIndel, a region-based diploid assembly approach to characterize large indel SVs. This pipeline only focuses on target regions (50kb by default) to extract barcoded reads as input and then integrates a haplotyping algorithm and local assembly to generate phased diploid contiguous sequences (contigs). Finally, it detects variants in the contigs through a pairwise contig-to-reference comparison.

Results: We applied RegionIndel on two linked-reads libraries of sample HG002, one using 10x and the other stLFR. HG002 is a well-studied sample and the Genome in a Bottle (GiaB) community provides a gold standard SV set for it. RegionIndel outperformed several assembly and alignment-based SV callers in our benchmark experiments. After assembling all indel SVs, RegionIndel achieved an overall F1 score of 74.8% in deletions and 61.8% in insertions for 10x linked-reads, and 64.3% in deletions and 36.7% in insertions for stLFR linked-reads, respectively. Furthermore, it achieved an overall genotyping accuracy of 83.6% and 80.8% for 10x and stLFR linked-reads, respectively.

Conclusions: RegionIndel can achieve diploid assembly and detect indel SVs in each target region. The phased diploid contigs can further allow us to investigate indel SVs with nearby linked single nucleotide polymorphism (SNPs) and small indels in the same haplotype.

Keywords: Diploid assembly; Linked-reads; Phasing; Region-based; Structural variants.

PubMed Disclaimer

Conflict of interest statement

Declarations. Code availability and requirements: Project name: RegionIndel Project home page: https://github.com/maiziezhoulab/RegionIndel Operating system(s): Linux Programming language: Python Other requirements: Python 3 or higher. License: The MIT License. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Schematic diagram of the RegionIndel pipeline. Input data are the high-quality reference genome and barcoded reads, each with a barcode (not shown). The reads extraction module extracts barcoded reads aligning to the region of interest. The reads with the same barcode (read cloud) form the virtual long fragment molecule. The haplotyping module partitions molecules into different parental haplotypes, and then the corresponding barcoded reads will be partitioned into different haplotypes. The local assembly module takes barcoded reads and performs de novo local assembly independently for each haplotype through SPAdes. Finally, the variant calling module integrates paftools to perform a contig-to-reference comparison to detect all variants
Fig. 2
Fig. 2
The distribution plot of the number of SVs as a function of the percentage of phased molecules or SV length. A The distribution plot as a function of the percentage of phased molecules in stLFR linked-reads for true positive (TP) and false negative (FN) deletions (top panels) and TP and FN insertions (bottom panels). B The distribution plot as a function of the percentage of phased molecules in 10x linked-reads for TP and FN deletions (top panels) and TP and FN insertions (bottom panels). Each vertical line represents the average percentage of phased molecules. C The distribution plot as a function of SV length for TP deletions (<20% phased molecules) D The distribution plot as a function of SV length for FN insertions (>80% phased molecules)
Fig. 3
Fig. 3
One example of 2.07kb heterozygous deletion from the stLFR linked-reads data by RegionIndel. A The IGV displays assembled diploid contigs from AquiaSV. It includes tracks for the contig BAM file from RegionIndel and the GiaB benchmark VCF file. The contigs also show that there are several variants in the nearby region. B It is a zoom-in view of the target SV in the center. The left inset box has the description of this benchmark SV, and the right inset box has the description of the contig information generated by RegionIndel. It indicates that this deletion is identified from the contig of haplotype1 and the length of this contig is 48.069kb. The gold standard supports this heterozygous deletion
Fig. 4
Fig. 4
One example of 70bp homozygous insertion from the 10x linked-reads data by RegionIndel. A The IGV displays assembled diploid contigs from AquiaSV. It includes tracks for the contig BAM file from RegionIndel and the GiaB benchmark VCF file. The contigs also show there are several variants in the nearby region. B It is a zoom-in view of the target SV in the center. The left inset box has the description of this benchmark SV, and the right inset box has the description of the contig information generated by RegionIndel. The gold standard supports this homozygous insertion. The contigs indicate this 70bp insertion is identified by contigs from both haplotypes

Similar articles

References

    1. Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet. 2011;12:363–73. - PMC - PubMed
    1. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:68–74. - PMC - PubMed
    1. Chaisson MJ, Huddleston J, et al. Resolving the complexity of the human genome using single- molecule sequencing. Nature. 2015;517:608–11. - PMC - PubMed
    1. Chaisson MJ, Sanders AD, Zhao X, et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun. 2019;10:1784. - PMC - PubMed
    1. Huddleston J, Eichler EE. An incomplete understanding of human ge- netic variation. Genetics. 2016;202:1251–4. - PMC - PubMed

LinkOut - more resources