Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Jun 30:2023.06.28.546576.
doi: 10.1101/2023.06.28.546576.

Scalable, accessible, and reproducible reference genome assembly and evaluation in Galaxy

Affiliations

Scalable, accessible, and reproducible reference genome assembly and evaluation in Galaxy

Delphine Larivière et al. bioRxiv. .

Update in

  • Scalable, accessible and reproducible reference genome assembly and evaluation in Galaxy.
    Larivière D, Abueg L, Brajuka N, Gallardo-Alba C, Grüning B, Ko BJ, Ostrovsky A, Palmada-Flores M, Pickett BD, Rabbani K, Antunes A, Balacco JR, Chaisson MJP, Cheng H, Collins J, Couture M, Denisova A, Fedrigo O, Gallo GR, Giani AM, Gooder GM, Horan K, Jain N, Johnson C, Kim H, Lee C, Marques-Bonet T, O'Toole B, Rhie A, Secomandi S, Sozzoni M, Tilley T, Uliano-Silva M, van den Beek M, Williams RW, Waterhouse RM, Phillippy AM, Jarvis ED, Schatz MC, Nekrutenko A, Formenti G. Larivière D, et al. Nat Biotechnol. 2024 Mar;42(3):367-370. doi: 10.1038/s41587-023-02100-3. Nat Biotechnol. 2024. PMID: 38278971 Free PMC article. No abstract available.

Abstract

Improvements in genome sequencing and assembly are enabling high-quality reference genomes for all species. However, the assembly process is still laborious, computationally and technically demanding, lacks standards for reproducibility, and is not readily scalable. Here we present the latest Vertebrate Genomes Project assembly pipeline and demonstrate that it delivers high-quality reference genomes at scale across a set of vertebrate species arising over the last ~500 million years. The pipeline is versatile and combines PacBio HiFi long-reads and Hi-C-based haplotype phasing in a new graph-based paradigm. Standardized quality control is performed automatically to troubleshoot assembly issues and assess biological complexities. We make the pipeline freely accessible through Galaxy, accommodating researchers even without local computational resources and enhanced reproducibility by democratizing the training and assembly process. We demonstrate the flexibility and reliability of the pipeline by assembling reference genomes for 51 vertebrate species from major taxonomic groups (fish, amphibians, reptiles, birds, and mammals).

Keywords: Genome assembly; accessible; large genomes; modularity; opensource; public; reproducibility; scalable.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Assembly workflows, analysis trajectories, and quality metrics. A. Eight analysis trajectories are possible depending on the combination of input data. Decision on invocation of workflow 6 is based on the analysis of QC output of workflows 3, 4, or 5 (see Supplemental data for full explanation). Thicker lines connecting workflows 7, 8, and 9 represent the fact that these workflows are invoked separately for each phased assembly (once for maternal and once for paternal). B. Schematic of assembly graph propagation through the assembly pipeline using the GFA format. Initially, PacBio contigs are generated in GFA format. The scaffolding information from Bionano and/or Hi-C is added to the graph through AGP intermediates . Manual curation can be integrated in the graph using AGP files. The final assembly is a collection of segments and unresolved overlaps from the original graph, with gaps and paths representing the scaffolding information. Paths can be converted to linear FASTA sequences for downstream analyses. During the scaffolding process gaps (green arrows) are added as jump (J) lines between the segments (blue arrows) to the GFA. This allows the information on unresolved overlaps (purple arrows) to be maintained while missed joins (dashed arrows) inferred from the scaffolding information are added to the graph. The final linear sequences are represented as paths in the graph (yellow highlight). C. QC generated by the pipeline. Zebra finch trio data supplemented with Bionano and Hi-C data is shown. i. GenomeScope profiles using 21-mers on child HiFi data and parental Illumina data. ii. Merqury profiles (upper three graphs) and gfastats continuity metrics (lower two graphs) following contiging. The CN plot shows the number of copies of k-mers in both assemblies. Single copy k-mers correspond to heterozygous regions, and two copies to homozygous regions. This plot provides information about the potential need to purge the assembly if it shows the presence of k-mers in three or more copies. The ASM plot shows that the paternal and maternal assemblies share k-mers at ~40× range that corresponds to the diploid coverage (also indicated by the rightmost peak in the child GenomeScope profile in panel). This is consistent with the expectation that phased assemblies will split the homozygous regions of the genome between the two haplotypes. Accordingly, heterozygous content in the genome is split mostly evenly between the two haplotypes, shown by the similar-sized assembly-specific peaks at ~20× (the maternal peak is slightly bigger due to the presence of the Z chromosome, which is larger than the W in the paternal assembly). The assembly blob plot (rightmost graph in the top row of subpanel ii indicates excellent stratification of hapmers (k-mers specific to a particular haplotype) across maternal (red) and paternal (blue) assemblies. iii. Pretext map of maternal assembly after Hi-C scaffolding shows increase in continuity compared with the panel. iv. General statistics and Pretext map of the maternal scaffolding after a second scaffolding step using Hi-C data. The map shows improvements compared to the previous scaffolding step, with fewer scaffolds and less physical proximity between scaffolds (non-diagonal Hi-C signal).
Figure 2
Figure 2
Extended evaluation of HiFi zebra finch assemblies using the three modes of the VGP pipeline in Galaxy: “Solo” (pseudohaplotype primary/alternate), “Solo w/Hi-C” (Hi-C-based phasing), and “Trio” (phasing using parental reads). A. shows 21-mers at 2-copy in the assembly with k-mer multiplicity between 50 and 90, a range that includes the expected diploid coverage region. B. shows k-mers that are present in only the reads, which suggests that the regions they represent (about 8 Mbp of sequence that ended up re-binned) are missing from the assembly, relative to the assembly in question. The purged Solo primary assembly has some missing regions, likely due to imperfect purging (k-mer multiplicity <5 excluded). C. Comparisons of k-mer duplication (red) and completeness (blue) between default and rebinned trio assemblies in males (left) and females (right). D. shows contigs from each assembly plotted according to how many parental hapmers are present in the contig, with contigs that were either fully phased (>95% of either parental k-mer) or lacking informative phasing information (i.e., less than 50 paternal and less than 50 maternal k-mers) excluded. The size of each bubble is proportional to the total k-mer size of the contig. Contigs along the diagonal have a mixed representation of hapmers from both parents, indicating intra-contig switch errors. Of these contigs, the Solo ones are typically larger and contain a higher amount of hapmers from both parents. E. Proportion of k-mer expansion and collapse in each diploid bTaeGut2 assembly. F. Proportion of k-mer duplication in the bTaeGut2 assemblies. We calculated k-mer duplications from the primary assemblies (CLR, Solo, Solo w/Hi-C) and paternal assembly (Trio) from phased diploid assemblies. G. Proportion and cumulated size (in Mb) of false losses of each assembly (above), and heat map of the size (in Mb) of false losses identified between the assemblies (below) in log scale. H. Proportion and cumulated size (in Mb) of false duplications of each assembly. I. A case of potential false gene gain in CLR assembly. Duplications of homologous sequences of partial ITSN1 gene was found in CLR assembly. Read depth coverage of contigs including the homologous sequences of ITSN1 gene in each bTaeGut2 assembly (highlighted in grey) is shown with a range from 0 to 200. The number in the gray highlighted region represents a mean depth coverage of ITSN1 homologous regions in each assembly.
Figure 3
Figure 3
Phylogenetic tree and assembly statistics of genomes assembled using the VGP assembly pipeline in Galaxy. From the most inner circle to the most outer circle: (i) Repeat content, (ii) Heterozygosity, (iii) Individual with two identical sexual chromosomes (white) or two different sexual chromosomes (blue), (iv) Assembly size in percentage of the genome size estimated by genomescope, (v) Scaffold NG50 in % of estimated genome size, (vi) Merqury completeness of both haplotypes, (vii), BUSCO completeness: presence of orthologous genes present and complete compared to the set expected in vertebrates, (viii) Mitogenome assembled and available (black), (ix) Genome Size in Gb, with lines at 9, 2, 3, 4, 6, and 8 GB, (x) Number of Scaffolds in log scale with lines at 1 (10 scaffolds), 2 (100 scaffolds), 3 (1,000 scaffolds), and 4 (10,000 scaffolds).
Figure 4
Figure 4
Comparison of assembly statistics between Sequencing technologies and scaffolding modes. Panels A to F compare assembly statistics between HiFi technology in black, used in this study, and CLR technology in grey, used in the previous version of the VGP assembly pipeline. Each dot represents an assembled species (Primary or Hap1), and the lines represent the linear regression for each technology. A. Scaffold NG50 in Mb in relation to Genome size in Gb. B. Contigs NG50 in relation to Repeat content, C. Gaps per GB in relation with Repeat content, D. Size of the Primary assembly in percent of the estimated genome size in relation with the heterozygosity rate of the genome, E. Genome completeness estimated by Merqury (both haplotypes together) in relation with the heterozygosity rate, F. Genome completeness in relation with repeat content.
Figure 5
Figure 5
Decontamination pipeline. A. Comparison of foreign contaminants classified by VGP Decontamination Pipeline to the ground truth set. Percent of ground truth contaminant scaffolds that were true positive, false negative and false positive classifications, relative to percent of bases these scaffolds represent from the whole assembly. B. Species of contaminants identified by decontamination. Blast results of the foreign contaminants including coverage from the alignment and the size of the contaminant sequence. 323 of the 325 contaminant sequences identified are represented here; the two excluded contaminants were identified as killer whale and are false classifications. Two hundred sixty five of the contaminant sequences are synthetic constructs. The points are jittered to capture density.
Figure 6
Figure 6
Duplication of XBP-1 locus in Cynocephalus volans and comparison to the human locus. a. The structure of the human XBP-1 gene. Exon 4 contains a 26 bp spacer (yellow) that is excised from mature mRNA by endonuclease IRE1α. When the transcript is not cleaved, the green reading frame is translated. When the spacer is removed the reading frame switches to orange downstream of the cleavage site. b. A nucleotide-level representation of exon 4 and corresponding translations for human and two flying lemur copies (Copy 1 and Copy 2). Single letter amino acid identifiers are centered at the second codon position. Red amino acids indicate sites with nucleotide changes. Two amino acids separated by “/” indicate amino acid replacement from one preceding slash to the trailing one: Q/R = change from Q to R. Single red amino acid indicates no change (synonymous substitution). Two stem-loop structures are critical secondary structure elements of the IRE1α cleavage site. c. Structure of two XBP-1 loci in Cynocephalus volans. Top panel shows the relative position of the two copies within scaffold 2. Copy 1 retains exon/intron structure identical to that of the human gene. Copy 2 lacks introns completely. Arrows indicate all possible reading frames (STOP-to-STOP) in the vicinity of each exon. Red = + strand; Blue = − strand.

References

    1. Hotaling S., Kelley J. L. & Frandsen P. B. Toward a genome sequence for every animal: Where are we now? Proc. Natl. Acad. Sci. U. S. A. 118, (2021). - PMC - PubMed
    1. Formenti G. et al. The era of reference genomes in conservation genomics. Trends Ecol. Evol. 37, 197–202 (2022). - PubMed
    1. Theissinger K. et al. How genomics can help biodiversity conservation. Trends Genet. (2023) doi:10.1016/j.tig.2023.01.005. - DOI - PubMed
    1. Lewin H. A. et al. The Earth BioGenome Project 2020: Starting the clock. Proc. Natl. Acad. Sci. U. S. A. 119, (2022). - PMC - PubMed
    1. Giani A. M., Gallo G. R., Gianfranceschi L. & Formenti G. Long walk to genomics: History and current approaches to genome sequencing and assembly. Comput. Struct. Biotechnol. J. 18, 9–19 (2020). - PMC - PubMed

Publication types