Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan 17:8:24.
doi: 10.12688/wellcomeopenres.18658.1. eCollection 2023.

Genomes on a Tree (GoaT): A versatile, scalable search engine for genomic and sequencing project metadata across the eukaryotic tree of life

Affiliations

Genomes on a Tree (GoaT): A versatile, scalable search engine for genomic and sequencing project metadata across the eukaryotic tree of life

Richard Challis et al. Wellcome Open Res. .

Abstract

As genomic data transform our understanding of biodiversity, the Earth BioGenome Project (EBP) has set a goal of generating reference quality genome assemblies for all ~1.9 million described eukaryotic taxa. Meeting this goal requires coordination among many individual regional and taxon-focussed projects working under the EBP umbrella. Large-scale sequencing projects require ready access to validated genome-relevant metadata, such as genome sizes and karyotypes, but these data are dispersed across the literature, and directly measured values are lacking for most taxa. To meet these needs, we have developed Genomes on a Tree (GoaT), an Elasticsearch-powered datastore and search index for genome-relevant metadata and sequencing project plans and statuses. GoaT indexes publicly available metadata for all eukaryotic species and interpolates missing values through phylogenetic comparison. GoaT also holds target priority and sequencing status information for many projects affiliated to the EBP to aid project coordination. Metadata and status attributes in GoaT can be queried through a mature API, a web front end, and a command line interface. The web front end additionally provides summary visualisations for data exploration and reporting (see https://goat.genomehubs.org). GoaT currently holds direct or estimated values for over 70 taxon attributes and over 30 assembly attributes across 1.5 million eukaryotic species. The depth and breadth of curated data, frequent updates, and a versatile query interface make GoaT a powerful data aggregator and portal to explore and report underlying data for the eukaryotic tree of life. We illustrate this utility through a series of use cases from planning through to completion of a genome-sequencing project.

Keywords: Genomics; Earth BioGenome Project; Tree of Life; Databasing; Elasticsearch.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. Schematic of the curation process.
GoaT data curation can be divided into two phases: ( A) Retrieval of taxon and assembly metadata in multiple formats from different sources; ( B) Standardisation of metadata to tabular format and preparation of corresponding import specification file. During preparation of TSV and YAML file pairs, curation includes normalisation of values, translation of attribute terms, mapping of columns to existing GoaT attributes, and defining the method for propagation of estimates.
Figure 2.
Figure 2.. Examples of the Configuration-as-Code (CaC) paradigm.
( A) Extracts from a YAML file for describing the Kew Plant C-values Database tabular data source, which specifies the file to be imported, the attributes it fills, what taxa each row refers to, and how the value in the table should be transformed ( Link A). ( B) Extracts from a YAML file for describing INSDC assembly attributes in general, how they should be processed and propagated, and how they should be displayed ( Link B).
Figure 3.
Figure 3.. GenomeHubs data import processes.
Pairs of data and metadata files (in TSV and YAML formats, respectively) are stored in a separate goat-data GitHub repository. The GenomeHubs init, index, and fill commands use these to create Elasticsearch indexes.
Figure 4.
Figure 4.. Taxonomically-informed inference of missing values.
Directly measured attribute values (green) are used to infer estimated values for parent nodes (orange). These estimated values are then used to fill any unknown descendant node values (red).
Figure 5.
Figure 5.. Primary GenomeHubs API/UI endpoints and UI mapping.
The autocomplete, search result table, taxon record and report components in the UI map directly to the /lookup, /search, /record and /report API endpoints, respectively.
Figure 6.
Figure 6.. Numbers of taxa and assemblies in GoaT.
( A) Number of taxa with publicly available assemblies out of all taxa at ranks from phylum to species ( Link A). ( B) Number of chromosomal or complete genome assemblies out of all assemblies ( link B).
Figure 7.
Figure 7.. Summary of available chromosomal assemblies for species on the DToL long_list.
( A) GoaT UI search result table for the query “ tax_rank(species) AND long_list=DTOL AND assembly_level=chromosome AND bioproject!=PRJEB40665” ( link A). ( B) Tree report for the same search, highlighting assemblies that meet the EBP threshold of “ contig_n50>1000000 AND scaffold_n50>10000000” ( link B). Green highlights indicate directly measured values while orange highlights show information derived from a descendant taxon. Tree reports are interactive and taxa can be displayed on tooltips, expanded into subtrees by long-pressing, or redirected by short-clicking to their respective records page.
Figure 8.
Figure 8.. Numbers of DToL target species that are also included on the target list of another EBP-affiliate project.
( A) Total number of DToL target species included on any other target list ( link A). ( B) Number of DToL target species included on a priority list of another project ( link B). ( C) Number DToL target species listed as a family representative by another project, but not by DToL ( link C).
Figure 9.
Figure 9.. GoaT search result table for the query “ tax_rank(species) AND long_list=dtol AND sequencing_status” showing sequencing status columns for species on the DToL target list ( link).
Figure 10.
Figure 10.. Tree reports showing the distribution of sequencing effort in Arthropoda.
Activity summarised ( A) by class ( link A) and ( B)by family ( link B). Taxa with publicly available genome assemblies for any descendant taxon have an orange highlight and those without have a red highlight. The interactive versions of these plots show tooltips on mouseover to display taxon names for arcs that are too small to accommodate a taxon name in the default display.
Figure 11.
Figure 11.. Examples of strategies to remove outliers from target lists.
( A) Using the query builder to refine a query to exclude species with large chromosome size and chromosome number above a threshold of 34, and with ploidy greater than 4n ( link A). ( B) Exploratory search of Allium species with ploidy varying from 2n to 8n ( link B). Taxa with the lowest values for chromosome number and genome size could be selected as genus representatives.
Figure 12.
Figure 12.. Distribution of genome size values across species in GoaT.
( A) showing directly measured values only ( link A), and ( B) including estimated values ( link B).
Figure 13.
Figure 13.. A taxon record page for Euchroma gigantea, NCBI Taxon ID 1580703 ( link).
This beetle has an unusual sex chromosome system (XXXYY; XXXYYY) and knowledge of this feature aids in resolution of assembly issues.
Figure 14.
Figure 14.. Tree report highlighting the distribution of BUSCO completeness scores among species in the bat family Pteropodidae with publicly available genome assemblies ( link).
Figure 15.
Figure 15.. Tree report of phyla on the DToL long list showing representative assembly span of each phylum ( link).
Orange highlights show phyla with at least one assembly released under the DToL BioProject (PRJEB40665). Red highlights show phyla with no publicly available assemblies. Taxa in grey have at least one publicly available assembly but none under the DToL bioProject.
Figure 16.
Figure 16.. Scatter report showing contiguity assessment of DToL genomes released under BioProject PRJEB40665.
The EBP metric zone highlights assemblies with a contig N50 > 1Mb and a scaffold N50 > 10Mb, ( A) highlighting assembly type with primary haploid assemblies are shown in green and alternate haplotypes in orange ( link A). ( B) highlighting assembly level with chromosomal assemblies shown in green ( link B).
Figure 17.
Figure 17.. Exploration of plastid and mitochondrial genome characteristics.
Scatter reports showing relationships between ( A) GC content ( Link A) and ( B) assembly span ( link B) for plant mitochondrial and plastid genome assemblies for all 99 Viridiplantae species in INSDC that have both organellar genomes present.
Figure 18.
Figure 18.. Scatter reports showing the relationship between assembly span and gene count by kingdom.
( A) all 3,078 assemblies ( Link A) and ( B) only 726 chromosome-level assemblies ( Link B).

References

    1. Lewin HA, Richards S, Lieberman Aiden E, et al. : The Earth BioGenome Project 2020: Starting the clock. Proc Natl Acad Sci U S A. 2022;119(4):e2115635118. 10.1073/pnas.2115635118 - DOI - PMC - PubMed
    1. Karsch-Mizrachi I, Takagi T, Cochrane G, et al. : The international nucleotide sequence database collaboration. Nucleic Acids Res. 2018;46(D1):D48–D51. 10.1093/nar/gkx1097 - DOI - PMC - PubMed
    1. Mukherjee S, Stamatis D, Bertsch J, et al. : Genomes OnLine Database (GOLD) v.8: overview and updates. Nucleic Acids Res. 2021;49(D1):D723–D733. 10.1093/nar/gkaa983 - DOI - PMC - PubMed
    1. Gregory TR: Animal Genome Size Database.2022. Reference Source
    1. Pellicer J, Leitch IJ: The Plant DNA C-values database (release 7.1): an updated online repository of plant genome size data for comparative studies. New Phytol. 2020;226(2):301–305. 10.1111/nph.16261 - DOI - PubMed

LinkOut - more resources