Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jun 20;19(1):234.
doi: 10.1186/s12859-018-2243-x.

ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers

Affiliations

ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers

Lauren Coombe et al. BMC Bioinformatics. .

Abstract

Background: The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time.

Results: Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50 = 4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50 = 14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~ 10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n = 13).

Conclusions: ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes.

Keywords: 10× Genomics Chromium; ARCS; ARKS; Genome scaffolding; Kmers; Linked reads; Next-generation sequencing; Read mapping; Supernova assembler; de novo assembly.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
ARKS algorithm. a In the first step, the barcoded Chromium reads are mapped to the ends of the draft assembly sequences (indicated as contigs in the figure) using a kmer-based approach. Reads from three distinct Chromium barcodes are depicted in red, green, and blue, with connecting dashed lines indicating the underlying long DNA molecules for each barcode. The gray regions of the target contigs indicate interior sequence that is masked during mapping. The barcode/contig associations derived from the read mappings are stored in a hash table that maps barcodes to contig ends. b In the second step, we iterate over the barcode-to-contig map and tally the number of barcodes that are shared by each candidate pair of contig ends. c In the third and final step, we generate the output scaffold graph by creating an edge for each candidate pair of contig ends that has greater than a threshold number of shared barcodes (0 by default)
Fig. 2
Fig. 2
Scaffolding a 10xG Supernova human genome assembly with ARKS. a Comparing the contiguity and accuracy of assemblies scaffolded by ARKS, ARCS, fragScaff and Architect as measured by QUAST. The baseline NA12878 Supernova assembly was scaffolded using ARCS (−c5 -s98 -m50–6000 -z3000 -e30000), ARKS (−c5 -k100 -t8 -j0.5 -m50–6000 -z3000 -e30000), fragScaff (−E30000 -j1 -u2) and Architect (--rc-abs-thr5 --rc-rel-edge-thr0.2, --rc-rel-prun-thr abbreviated to ‘P’). The Y-axes show the range of NGA50 to NG50 lengths to indicate the uncertainty caused by real genomic variations between individual NA24143 and the reference genome GRCh38. b A Circos [24] assembly consistency plot of ARKS (−k100 -j0.5 -c5 -e30000 -z3000 -m50–6000 -r0.05 -a0.5) scaffolding of the baseline NA12878 Supernova assembly. Scaftigs from the largest 123 scaffolds, consisting of 85% (NG85) of the genome, are aligned to GRCh38 with BWA mem [18]. GRCh38 chromosomes are displayed incrementally from 1 (bottom, red) to X (top, fuchsia) on the left while scaffolds (grey with black outlines) are displayed on the right side of the rim. Connections show aligned regions, 1 Mbp and larger, between the genome and scaffolds. Large-scale misassemblies are visible as cross-over ribbons. The black regions on chromosomes indicate reconstruction gaps in the reference. The majority of each chromosome is represented in the final ARKS assembly by no more than 13 assembly scaffolds
Fig. 3
Fig. 3
Chromosome-scale ARKS scaffolding of a NA12878 10xG Supernova assembly. Scaftigs from the NG85 scaffolds of the baseline NA12878 Supernova assembly (blue) and the assembly following ARKS scaffolding (−k100 -j0.5 -c5 -e30000 -z3000 -m50–6000 -r0.05 -a0.5) (green) were aligned to GRCh38 using BWA mem [18]. The ideogram was generated using the R package chromPlot [25]. For each assembly, alternating shades represent different scaffolds. Red bands on the reference chromosomes denote gaps in the reference assembly
Fig. 4
Fig. 4
Benchmarking wall clock time for ARCS and ARKS. The wall clock benchmarking is shown for the most contiguous assemblies from scaffolding the NA12878 Supernova assembly with ARCS (−a0.5 -s98 -c5 -m50–6000 -e30000 -z3000) and ARKS (−a0.5 -k100 -t8 -c5 -j0.5 -m50–6000 -e30000 -z3000). For ARCS, the linked reads were partitioned into eight sets, and aligned to the draft assembly in parallel. The wall clock time for the bottleneck read alignment is shown. ARKS was run with eight threads, while only the alignment step of ARCS was run with eight threads, as threading is not implemented in ARCS
Fig. 5
Fig. 5
Contiguity gains from scaffolding a NA24143 PacBio Falcon + HiRise human genome assembly with ARKS. The base Falcon (orange bar) and Falcon + HiRise (HR) (green) draft genomes were corrected with Tigmint (tg) [19] (w = 2000, n = 2), then scaffolded using ARKS (−a0.3 -c5 -k40 -j0.5 -e30000 -z3000 -t8 -m50–1000) (pink and blue bars). We also ran ARKS on the base Falcon (yellow) and Falcon+HiRise (turquoise) draft genomes without Tigmint correction

Similar articles

Cited by

References

    1. Zheng GX, Lau BT, Schnall-Levin M, Jarosz M, Bell JM, Hindson CM, et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol. 2016;34(3):303–311. doi: 10.1038/nbt.3432. - DOI - PMC - PubMed
    1. Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB. Direct determination of diploid genome sequences. Genome Res. 2017;27(5):757–767. doi: 10.1101/gr.214874.116. - DOI - PMC - PubMed
    1. Yeo S, Coombe L, Chu J, Warren RL, Birol I. ARCS: scaffolding genome drafts with linked reads. Bioinformatics. 2018;34(5):725–731. doi: 10.1093/bioinformatics/btx675. - DOI - PMC - PubMed
    1. Adey A, Kitzman JO, Burton JN, Daza R, Kumar A, Christiansen L, et al. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity. Genome Res. 2014;24(12):2041–2049. doi: 10.1101/gr.178319.114. - DOI - PMC - PubMed
    1. Kuleshov V, Snyder MP, Batzoglou S. Genome assembly from synthetic long read clouds. Bioinformatics. 2016;32(12):i216–i224. doi: 10.1093/bioinformatics/btw267. - DOI - PMC - PubMed

Publication types