ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers

Lauren Coombe¹, Jessica Zhang¹, Benjamin P Vandervalk¹, Justin Chu¹, Shaun D Jackman¹, Inanc Birol¹, René L Warren²

Affiliations

¹ BC Cancer Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada.
² BC Cancer Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada. rwarren@bcgsc.ca.

PMID: 29925315
PMCID: PMC6011487
DOI: 10.1186/s12859-018-2243-x

ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers

Lauren Coombe et al. BMC Bioinformatics. 2018.

. 2018 Jun 20;19(1):234.

doi: 10.1186/s12859-018-2243-x.

Authors

Lauren Coombe¹, Jessica Zhang¹, Benjamin P Vandervalk¹, Justin Chu¹, Shaun D Jackman¹, Inanc Birol¹, René L Warren²

Affiliations

¹ BC Cancer Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada.
² BC Cancer Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada. rwarren@bcgsc.ca.

PMID: 29925315
PMCID: PMC6011487
DOI: 10.1186/s12859-018-2243-x

Abstract

Background: The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time.

Results: Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50 = 4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50 = 14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~ 10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n = 13).

Conclusions: ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes.

Keywords: 10× Genomics Chromium; ARCS; ARKS; Genome scaffolding; Kmers; Linked reads; Next-generation sequencing; Read mapping; Supernova assembler; de novo assembly.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
ARKS algorithm. a In the first step, the barcoded Chromium reads are mapped to the ends of the draft assembly sequences (indicated as contigs in the figure) using a *kmer*-based approach. Reads from three distinct Chromium barcodes are depicted in red, green, and blue, with connecting dashed lines indicating the underlying long DNA molecules for each barcode. The gray regions of the target contigs indicate interior sequence that is masked during mapping. The barcode/contig associations derived from the read mappings are stored in a hash table that maps barcodes to contig ends. b In the second step, we iterate over the barcode-to-contig map and tally the number of barcodes that are shared by each candidate pair of contig ends. c In the third and final step, we generate the output scaffold graph by creating an edge for each candidate pair of contig ends that has greater than a threshold number of shared barcodes (0 by default)

**Fig. 2**
Scaffolding a 10xG Supernova human genome assembly with ARKS. a Comparing the contiguity and accuracy of assemblies scaffolded by ARKS, ARCS, fragScaff and Architect as measured by QUAST. The baseline NA12878 Supernova assembly was scaffolded using ARCS (−c5 -s98 -m50–6000 -z3000 -e30000), ARKS (−c5 -k100 -t8 -j0.5 -m50–6000 -z3000 -e30000), fragScaff (−E30000 -j1 -u2) and Architect (--rc-abs-thr5 --rc-rel-edge-thr0.2, --rc-rel-prun-thr abbreviated to ‘P’). The Y-axes show the range of NGA50 to NG50 lengths to indicate the uncertainty caused by real genomic variations between individual NA24143 and the reference genome GRCh38. b A Circos [24] assembly consistency plot of ARKS (−k100 -j0.5 -c5 -e30000 -z3000 -m50–6000 -r0.05 -a0.5) scaffolding of the baseline NA12878 Supernova assembly. Scaftigs from the largest 123 scaffolds, consisting of 85% (NG85) of the genome, are aligned to GRCh38 with BWA mem [18]. GRCh38 chromosomes are displayed incrementally from 1 (bottom, red) to X (top, fuchsia) on the left while scaffolds (grey with black outlines) are displayed on the right side of the rim. Connections show aligned regions, 1 Mbp and larger, between the genome and scaffolds. Large-scale misassemblies are visible as cross-over ribbons. The black regions on chromosomes indicate reconstruction gaps in the reference. The majority of each chromosome is represented in the final ARKS assembly by no more than 13 assembly scaffolds

**Fig. 3**
Chromosome-scale ARKS scaffolding of a NA12878 10xG Supernova assembly. Scaftigs from the NG85 scaffolds of the baseline NA12878 Supernova assembly (blue) and the assembly following ARKS scaffolding (−k100 -j0.5 -c5 -e30000 -z3000 -m50–6000 -r0.05 -a0.5) (green) were aligned to GRCh38 using BWA mem [18]. The ideogram was generated using the R package chromPlot [25]. For each assembly, alternating shades represent different scaffolds. Red bands on the reference chromosomes denote gaps in the reference assembly

**Fig. 4**
Benchmarking wall clock time for ARCS and ARKS. The wall clock benchmarking is shown for the most contiguous assemblies from scaffolding the NA12878 Supernova assembly with ARCS (−a0.5 -s98 -c5 -m50–6000 -e30000 -z3000) and ARKS (−a0.5 -k100 -t8 -c5 -j0.5 -m50–6000 -e30000 -z3000). For ARCS, the linked reads were partitioned into eight sets, and aligned to the draft assembly in parallel. The wall clock time for the bottleneck read alignment is shown. ARKS was run with eight threads, while only the alignment step of ARCS was run with eight threads, as threading is not implemented in ARCS

**Fig. 5**
Contiguity gains from scaffolding a NA24143 PacBio Falcon + HiRise human genome assembly with ARKS. The base Falcon (orange bar) and Falcon + HiRise (HR) (green) draft genomes were corrected with Tigmint (tg) [19] (w = 2000, n = 2), then scaffolded using ARKS (−a0.3 -c5 -k40 -j0.5 -e30000 -z3000 -t8 -m50–1000) (pink and blue bars). We also ran ARKS on the base Falcon (yellow) and Falcon+HiRise (turquoise) draft genomes without Tigmint correction

See this image and copyright information in PMC

Cited by

SWALO: scaffolding with assembly likelihood optimization.
Rahman A, Pachter L. Rahman A, et al. Nucleic Acids Res. 2021 Nov 18;49(20):e117. doi: 10.1093/nar/gkab717. Nucleic Acids Res. 2021. PMID: 34417615 Free PMC article.
Identifying the causes and consequences of assembly gaps using a multiplatform genome assembly of a bird-of-paradise.
Peona V, Blom MPK, Xu L, Burri R, Sullivan S, Bunikis I, Liachko I, Haryoko T, Jønsson KA, Zhou Q, Irestedt M, Suh A. Peona V, et al. Mol Ecol Resour. 2021 Jan;21(1):263-286. doi: 10.1111/1755-0998.13252. Epub 2020 Oct 10. Mol Ecol Resour. 2021. PMID: 32937018 Free PMC article.
SpLitteR: diploid genome assembly using TELL-Seq linked-reads and assembly graphs.
Tolstoganov I, Chen Z, Pevzner P, Korobeynikov A. Tolstoganov I, et al. PeerJ. 2024 Sep 27;12:e18050. doi: 10.7717/peerj.18050. eCollection 2024. PeerJ. 2024. PMID: 39351368 Free PMC article.
Fonio millet genome unlocks African orphan crop diversity for agriculture in a changing climate.
Abrouk M, Ahmed HI, Cubry P, Šimoníková D, Cauet S, Pailles Y, Bettgenhaeuser J, Gapa L, Scarcelli N, Couderc M, Zekraoui L, Kathiresan N, Čížková J, Hřibová E, Doležel J, Arribat S, Bergès H, Wieringa JJ, Gueye M, Kane NA, Leclerc C, Causse S, Vancoppenolle S, Billot C, Wicker T, Vigouroux Y, Barnaud A, Krattinger SG. Abrouk M, et al. Nat Commun. 2020 Sep 8;11(1):4488. doi: 10.1038/s41467-020-18329-4. Nat Commun. 2020. PMID: 32901040 Free PMC article.
A Highly Contiguous and Annotated Genome Assembly of the Lesser Prairie-Chicken (Tympanuchus pallidicinctus).
Black AN, Bondo KJ, Mularo A, Hernandez A, Yu Y, Stein CM, Gregory A, Fricke KA, Prendergast J, Sullins D, Haukos D, Whitson M, Grisham B, Lowe Z, DeWoody JA. Black AN, et al. Genome Biol Evol. 2023 Apr 6;15(4):evad043. doi: 10.1093/gbe/evad043. Genome Biol Evol. 2023. PMID: 36916502 Free PMC article.

See all "Cited by" articles

References

1. Zheng GX, Lau BT, Schnall-Levin M, Jarosz M, Bell JM, Hindson CM, et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat Biotechnol. 2016;34(3):303–311. doi: 10.1038/nbt.3432. - DOI - PMC - PubMed
1. Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB. Direct determination of diploid genome sequences. Genome Res. 2017;27(5):757–767. doi: 10.1101/gr.214874.116. - DOI - PMC - PubMed
1. Yeo S, Coombe L, Chu J, Warren RL, Birol I. ARCS: scaffolding genome drafts with linked reads. Bioinformatics. 2018;34(5):725–731. doi: 10.1093/bioinformatics/btx675. - DOI - PMC - PubMed
1. Adey A, Kitzman JO, Burton JN, Daza R, Kumar A, Christiansen L, et al. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity. Genome Res. 2014;24(12):2041–2049. doi: 10.1101/gr.178319.114. - DOI - PMC - PubMed
1. Kuleshov V, Snyder MP, Batzoglou S. Genome assembly from synthetic long read clouds. Bioinformatics. 2016;32(12):i216–i224. doi: 10.1093/bioinformatics/btw267. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 HG007182/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed