Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 2;38(17):4214-4216.
doi: 10.1093/bioinformatics/btac460.

Gfastats: conversion, evaluation and manipulation of genome sequences using assembly graphs

Affiliations

Gfastats: conversion, evaluation and manipulation of genome sequences using assembly graphs

Giulio Formenti et al. Bioinformatics. .

Abstract

Motivation: With the current pace at which reference genomes are being produced, the availability of tools that can reliably and efficiently generate genome assembly summary statistics has become critical. Additionally, with the emergence of new algorithms and data types, tools that can improve the quality of existing assemblies through automated and manual curation are required.

Results: We sought to address both these needs by developing gfastats, as part of the Vertebrate Genomes Project (VGP) effort to generate high-quality reference genomes at scale. Gfastats is a standalone tool to compute assembly summary statistics and manipulate assembly sequences in FASTA, FASTQ or GFA [.gz] format. Gfastats stores assembly sequences internally in a GFA-like format. This feature allows gfastats to seamlessly convert FAST* to and from GFA [.gz] files. Gfastats can also build an assembly graph that can in turn be used to manipulate the underlying sequences following instructions provided by the user, while simultaneously generating key metrics for the new sequences.

Availability and implementation: Gfastats is implemented in C++. Precompiled releases (Linux, MacOS, Windows) and commented source code for gfastats are available under MIT licence at https://github.com/vgl-hub/gfastats. Examples of how to run gfastats are provided in the GitHub. Gfastats is also available in Bioconda, in Galaxy (https://assembly.usegalaxy.eu) and as a MultiQC module (https://github.com/ewels/MultiQC). An automated test workflow is available to ensure consistency of software updates.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(a) Schematic of gfastats workflow. Inputs (top trapezoids) include genome assemblies in FASTA, FASTQ, GFA [.gz] formats and include/exclude lists as bed coordinate files for filtering (first diamond). These are represented internally by multiple C++ classes including their constituent elements (rectangles). The assembly can be converted to a graph (first oval), to ease manipulation by the internal Swiss Army Knife (SAK; second diamond), and then summary statistics are computed (second oval). A variety of outputs can be generated, such as summary statistics and new sequences in *fa* format. (b) Internal bidirected graph representation of the input sequences. Segments (nodes A, B, C) are connected by forward (b, c, e, g) or backward (a, d, f, h) gaps edges. Terminal nodes can optionally be associated with gaps (dashed lines, gap edges a, b, g and h). An assembly scaffold is a path in the graph (e.g. A → c → B → e → C, grey middle line). Sequence manipulation can be achieved using the internal SAK. For instance, the given path could be split by removing gap edges c and d that connect segment nodes A and B, leading to a disconnected node A, and two connected nodes B and C linked by edges e and f (portion of the path in light grey removed). Overlap edges can be treated in the same way. (c) Evaluation of gfastats runtime. Performance time is a function of genome size, with gfastats runtime increasing linearly. There is a small increase in time when handling gzip-compressed files

References

    1. Cheng H. et al. (2021) Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods, 18, 170–175. - PMC - PubMed
    1. Cheng H. et al. (2022) Haplotype-resolved assembly of diploid genomes without parental data. Nat. Biotechnol. 10.1038/s41587-022-01261-x. - DOI - PMC - PubMed
    1. Cock P.J.A. et al. (2010) The sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res., 38, 1767–1771. - PMC - PubMed
    1. Dawson E.T., Durbin R. (2019) GFAKluge: a C++ library and command line utilities for the graphical fragment assembly formats. J. Open Source Softw., 4, 1083. - PMC - PubMed
    1. Ewels P. et al. (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32, 3047–3048. - PMC - PubMed

Publication types