Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024:312:22.
doi: 10.4230/LIPIcs.WABI.2024.22. Epub 2024 Aug 26.

Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads

Affiliations

Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads

Xiaofei Carl Zang et al. Lebniz Int Proc Inform. 2024.

Abstract

Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol. LoopSeq Solo also achieves ultra-high sequencing depth and high purity of short reads covering the entire captured molecule. Despite the availability of many assembly methods, constructing full-length sequence from these anchor-enabled, ultra-high coverage sequencing data remains challenging due to the complexity of the underlying assembly graphs and the lack of specific algorithms leveraging anchors. We present Anchorage, a novel assembler that performs anchor-guided assembly for ultra-high-depth sequencing data. Anchorage starts with a kmer-based approach for precise estimation of molecule lengths. It then formulates the assembly problem as finding an optimal path that connects the two nodes determined by anchors in the underlying compact de Bruijn graph. The optimality is defined as maximizing the weight of the smallest node while matching the estimated sequence length. Anchorage uses a modified dynamic programming algorithm to efficiently find the optimal path. Through both simulations and real data, we show that Anchorage outperforms existing assembly methods, particularly in the presence of sequencing artifacts. Anchorage fills the gap in assembling anchor-enabled data. We anticipate its broad use as anchor-enabled sequencing technologies become prevalent. Anchorage is freely available at https://github.com/Shao-Group/anchorage; the scripts and documents that can reproduce all experiments in this manuscript are available at https://github.com/Shao-Group/anchorage-test.

Keywords: Applied computing → Molecular sequence analysis; Genome assembly; LoopSeq; anchor-guided assembly; de Bruijn graph; synthetic long reads.

PubMed Disclaimer

Conflict of interest statement

6Conflict of interest K.M., T.B.-Y., and R.K. are current or former employees of Element Biosciences and may hold stock options in the company.

Figures

Figure 1
Figure 1. Comparison of assembly accuracy on real LoopSeq Solo sequencing datasets.
Anchorage, SPAdes, and MEGAHIT used all reads; SPAdes500 and MEGAHIT500 used 500 reads via random downsampling. The height of each bar represents the average value of each metric and the average value is labeled on each bar. Each dot represents the value of one assembly.
Figure 2
Figure 2. Comparison of assembly accuracy on simulated reads without artifacts.
Anchorage, SPAdes, and MEGAHIT used all reads; SPAdes500 and MEGAHIT500 used 500 reads via random downsampling. The height of each bar represents the average value of each metric and the average value is labeled on each bar. The whiskers in the GFP and LAR panels extend from the 25th to 75th percentile of values in each metric.
Figure 3
Figure 3. Comparison of assembly accuracy on simulated reads with read-throughs.
Anchorage, SPAdes, and MEGAHIT used all reads; SPAdes500 and MEGAHIT500 used 500 reads via random downsampling. The height of each bar represents the average value of each metric and the average value is labeled on each bar. The whiskers in the GFP and LAR panels extend from the 25th to 75th percentile of values in each metric.
Figure 4
Figure 4. Comparison of assembly accuracy on simulated reads with repetitive sequences.
Anchorage, SPAdes, and MEGAHIT used all reads; SPAdes500 and MEGAHIT500 used 500 reads via random downsampling. The height of each bar represents the average value of each metric and the average value is labeled on each bar. The whiskers in the GFP and LAR panels extend from the 25th to 75th percentile of values in each metric.

Similar articles

References

    1. Bankevich Anton, Nurk Sergey, Antipov Dmitry, Gurevich Alexey A., Dvorkin Mikhail, Kulikov Alexander S., Lesin Valery M., Nikolenko Sergey I., Pham Son, Prjibelski Andrey D., Pyshkin Alexey V., Sirotkin Alexander V., Vyahhi Nikolay, Tesler Glenn, Alekseyev Max A., and Pevzner Pavel A. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5):455–477, 2012. - PMC - PubMed
    1. Benton Briana, King Stephen, Greenfield Samuel R., Puthuveetil Nikhita, Reese Amy L., Duncan James, Marlow Robert, Tabron Corina, Pierola Amanda E., Yarmosh David A., Combs Patrick Ford, Riojas Marco A., Bagnoli John, and Jacobs Jonathan L. The ATCC Genome Portal: Microbial Genome Reference Standards with Data Provenance. Microbiology Resource Announcements, 10(47):e00818–21, 2023. - PMC - PubMed
    1. Birol Inanç, Jackman Shaun D., Nielsen Cydney B., Qian Jenny Q., Varhol Richard, Stazyk Greg, Morin Ryan D., Zhao Yongjun, Hirst Martin, Schein Jacqueline E., Horsman Doug E., Connors Joseph M., Gascoyne Randy D., Marra Marco A., and Jones Steven J. M. De novo transcriptome assembly with ABySS. Bioinformatics, 25(21):2872–2877, 2009. - PubMed
    1. Bolger Anthony M., Lohse Marc, and Usadel Bjoern. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics, 30(15):2114–2120, 2014. - PMC - PubMed
    1. Callahan Benjamin J., Grinevich Dmitry, Thakur Siddhartha, Balamotis Michael A., and Ben Yehezkel Tuval. Ultra-accurate microbial amplicon sequencing with synthetic long reads. Microbiome, 9(1):130, 2021. - PMC - PubMed

LinkOut - more resources