NCBI's Virus Discovery Codeathon: Building "FIVE" -The Federated Index of Viral Experiments API Index

Joan Martí-Carreras¹, Alejandro Rafael Gener^{2

3

4

5}, Sierra D Miller⁶, Anderson F Brito⁷, Christiam E Camacho⁸, Ryan Connor⁸, Ward Deboutte¹, Cody Glickman¹, David M Kristensen⁹, Wynn K Meyer¹⁰, Sejal Modha¹¹, Alexis L Norris¹², Surya Saha^{13

14}, Anna K Belford¹⁵, Evan Biederstedt¹⁶, James Rodney Brister⁸, Jan P Buchmann¹⁷, Nicholas P Cooley¹⁸, Robert A Edwards¹⁹, Kiran Javkar^{20

21}, Michael Muchow²², Harihara Subrahmaniam Muralidharan^{20

23}, Charles Pepe-Ranney²⁴, Nidhi Shah²⁰, Migun Shakya²⁵, Michael J Tisza¹⁵, Benjamin J Tully²⁶, Bert Vanmechelen¹, Valerie C Virta²⁷, J L Weissman²⁸, Vadim Zalunin⁸, Alexandre Efremov⁸, Ben Busby^{8

29}

Affiliations

¹ Laboratory of Clinical and Epidemiological Virology, KU Leuven Department of Microbiology, Immunology and Transplantation, Rega Institute, Leuven BE3000, Belgium.
² Integrative Molecular and Biomedical Sciences Program, Baylor College of Medicine, Houston, TX 77030, USA.
³ Margaret M. and Albert B. Alkek Department of Medicine, Nephrology, Baylor College of Medicine, Houston, TX 77030, USA.
⁴ Department of Genetics, MD Anderson Cancer Center, Houston, TX 77030, USA.
⁵ School of Medicine, Universidad Central del Caribe, Bayamón, Puerto Rico 00960, USA.
⁶ Genetics & Molecular Biology, Millersville University, 40 Dilworth Rd, Millersville, PA 17551, USA.
⁷ Department of Epidemiology of Microbial Diseases, Yale School of Public Health (YSPH), 60 College Street, New Haven, CT 06510, USA.
⁸ National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, 9000 Rockville Pike, Bethesda, MD 20894, USA.
⁹ Computational Bioscience Program, University of Colorado Anschutz, Aurora, CO 80045, USA.
¹⁰ AAAS Science and Technology Policy Fellow, Office of Data Science Strategy, Division of Program Coordination, Planning, and Strategic Initiatives, Office of the Director, National Institutes of Health, 31 Center Dr., Bethesda, MD 20894, USA.
¹¹ MRC-University of Glasgow Centre for Virus Research, G61 1QH Glasgow, UK.
¹² Biotechnology Graduate Program, University of Maryland Global Campus, 1616 McCormick Drive, Largo, MD 20774, USA.
¹³ Boyce Thompson Institute, Ithaca, NY 14850, USA.
¹⁴ School of Animal and Comparative Biomedical Sciences, The University of Arizona, Tucson, AZ 85721, USA.
¹⁵ Laboratory of Cellular Oncology, National Cancer Institute, 37 Convent Dr., Bethesda, MD 20894, USA.
¹⁶ Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA.
¹⁷ School of Life and Environmental Sciences and School of Medical Sciences, Marie Bashir Institute for Infectious Diseases and Biosecurity, The University of Sydney, Sydney, Australia.
¹⁸ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15260, USA.
¹⁹ College of Science and Engineering, Flinders University, Bedford Park, SA 5042, Australia.
²⁰ Department of Computer Science, University of Maryland, College Park, MD 20740, USA.
²¹ Joint Institute for Food Safety and Applied Nutrition, University of Maryland, College Park, MD 20740, USA.
²² Novel Microdevices, Nucleic Acids, Baltimore, MD 21202, USA.
²³ Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20740, USA.
²⁴ AgBiome, 104 TW Alexander, Research Triangle, NC 27709, USA.
²⁵ Bioscience Division, Bikini Atoll Road, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
²⁶ Center for Dark Energy Biosphere Investigations, University of Southern California, Los Angeles, CA 90089, USA.
²⁷ AAAS Science & Technology Policy Fellow, National Institutes of Health, Center for Information Technology, 6555 Rock Spring Drive, Bethesda, MD 20817, USA.
²⁸ Department of Marine and Environmental Biology, University of Southern California, Los Angeles, CA 90089, USA.
²⁹ DNANexus, 1975 W El Camino Real #204, Mountain View, CA 94040, USA.

PMID: 33322070
PMCID: PMC7764237
DOI: 10.3390/v12121424

NCBI's Virus Discovery Codeathon: Building "FIVE" -The Federated Index of Viral Experiments API Index

Joan Martí-Carreras et al. Viruses. 2020.

. 2020 Dec 10;12(12):1424.

doi: 10.3390/v12121424.

Authors

Affiliations

¹ Laboratory of Clinical and Epidemiological Virology, KU Leuven Department of Microbiology, Immunology and Transplantation, Rega Institute, Leuven BE3000, Belgium.
² Integrative Molecular and Biomedical Sciences Program, Baylor College of Medicine, Houston, TX 77030, USA.
³ Margaret M. and Albert B. Alkek Department of Medicine, Nephrology, Baylor College of Medicine, Houston, TX 77030, USA.
⁴ Department of Genetics, MD Anderson Cancer Center, Houston, TX 77030, USA.
⁵ School of Medicine, Universidad Central del Caribe, Bayamón, Puerto Rico 00960, USA.
⁶ Genetics & Molecular Biology, Millersville University, 40 Dilworth Rd, Millersville, PA 17551, USA.
⁷ Department of Epidemiology of Microbial Diseases, Yale School of Public Health (YSPH), 60 College Street, New Haven, CT 06510, USA.
⁸ National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health, 9000 Rockville Pike, Bethesda, MD 20894, USA.
⁹ Computational Bioscience Program, University of Colorado Anschutz, Aurora, CO 80045, USA.
¹⁰ AAAS Science and Technology Policy Fellow, Office of Data Science Strategy, Division of Program Coordination, Planning, and Strategic Initiatives, Office of the Director, National Institutes of Health, 31 Center Dr., Bethesda, MD 20894, USA.
¹¹ MRC-University of Glasgow Centre for Virus Research, G61 1QH Glasgow, UK.
¹² Biotechnology Graduate Program, University of Maryland Global Campus, 1616 McCormick Drive, Largo, MD 20774, USA.
¹³ Boyce Thompson Institute, Ithaca, NY 14850, USA.
¹⁴ School of Animal and Comparative Biomedical Sciences, The University of Arizona, Tucson, AZ 85721, USA.
¹⁵ Laboratory of Cellular Oncology, National Cancer Institute, 37 Convent Dr., Bethesda, MD 20894, USA.
¹⁶ Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA.
¹⁷ School of Life and Environmental Sciences and School of Medical Sciences, Marie Bashir Institute for Infectious Diseases and Biosecurity, The University of Sydney, Sydney, Australia.
¹⁸ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15260, USA.
¹⁹ College of Science and Engineering, Flinders University, Bedford Park, SA 5042, Australia.
²⁰ Department of Computer Science, University of Maryland, College Park, MD 20740, USA.
²¹ Joint Institute for Food Safety and Applied Nutrition, University of Maryland, College Park, MD 20740, USA.
²² Novel Microdevices, Nucleic Acids, Baltimore, MD 21202, USA.
²³ Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20740, USA.
²⁴ AgBiome, 104 TW Alexander, Research Triangle, NC 27709, USA.
²⁵ Bioscience Division, Bikini Atoll Road, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
²⁶ Center for Dark Energy Biosphere Investigations, University of Southern California, Los Angeles, CA 90089, USA.
²⁷ AAAS Science & Technology Policy Fellow, National Institutes of Health, Center for Information Technology, 6555 Rock Spring Drive, Bethesda, MD 20817, USA.
²⁸ Department of Marine and Environmental Biology, University of Southern California, Los Angeles, CA 90089, USA.
²⁹ DNANexus, 1975 W El Camino Real #204, Mountain View, CA 94040, USA.

PMID: 33322070
PMCID: PMC7764237
DOI: 10.3390/v12121424

Abstract

Viruses represent important test cases for data federation due to their genome size and the rapid increase in sequence data in publicly available databases. However, some consequences of previously decentralized (unfederated) data are lack of consensus or comparisons between feature annotations. Unifying or displaying alternative annotations should be a priority both for communities with robust entry representation and for nascent communities with burgeoning data sources. To this end, during this three-day continuation of the Virus Hunting Toolkit codeathon series (VHT-2), a new integrated and federated viral index was elaborated. This Federated Index of Viral Experiments (FIVE) integrates pre-existing and novel functional and taxonomy annotations and virus-host pairings. Variability in the context of viral genomic diversity is often overlooked in virus databases. As a proof-of-concept, FIVE was the first attempt to include viral genome variation for HIV, the most well-studied human pathogen, through viral genome diversity graphs. As per the publication of this manuscript, FIVE is the first implementation of a virus-specific federated index of such scope. FIVE is coded in BigQuery for optimal access of large quantities of data and is publicly accessible. Many projects of database or index federation fail to provide easier alternatives to access or query information. To this end, a Python API query system was developed to enhance the accessibility of FIVE.

Keywords: CRISPR; HIV-1; data federation; genome graphs; metagenomics; protein domain; virus.

PubMed Disclaimer

Conflict of interest statement

J.M.C and A.R.G have received travel awards and bursaries from Oxford Nanopore Technologies, Oxford, UK. This material should not be interpreted as representing the viewpoint of the U.S. Department of Health and Human Services, the National Institutes of Health, Food and Drug Administration, National Library of Medicine, National Center for Biotechnology Information, Center for Information Technology. No other competing interests to disclose.

Figures

**Figure 1**
Protein Domain Recognition Pipeline. Using 2082 entries from CDD (Conserved Domains Database) domain models in PSSM (Position-Specific Scoring Matrix) format, we tested two pipelines: RPS-BLAST and Mash. RPS-BLAST, with known domain models matched against assembled contigs, is accurate but computationally expensive. The Mash pipeline, which is significantly faster and can be applied directly on unassembled reads, was also tested.

**Figure 2**
A schematic representation of Federated Index of Viral Experiments (FIVE) implementation, and interactions with users, enabled through the viral-index Application Programming Interface (API). Viral information generated in both codeathons is indexed in BigQuery on FIVE, accessible from Google Cloud, which can be easily queried using the viral-index API [48]. This API enables users to perform a range of flexible searches on the FIVE databases with minimum code.

**Figure 3**
Tanglegram depicting hierarchical clustering performed on the Canberra distance matrices derived from the domain counts matrices of both Mash and RPS-tBLASTn pipelines. Both dendrograms are colored by their cluster id with k = 10. Base R function hclust was used to generate the clustering [18]. Correlation between both matrices was calculated with the Mantel test implemented in the ade4 R package [19]. The entanglement value and plot were generated with the Entanglement and Tanglegram functions implemented in the dendextend package [21]. Robinson–Foulds distance was calculated using the RF.dist function implemented in the Phangorn package [20].

**Figure 4**
(Left) HIV-1 reference genome graphs generated with SWIft Genomes in a Graph (SWIGG) with annotated k-mers/nodes. Number of input sequences (n) = 167. Node color corresponds to taxonomic distribution of k-mer. Size of nodes is proportional to occurrence of taxonomic category. (Right) HIV-1 subtypes A–J (n = 39), k-mer size = 41, threshold ≥ 2. Note that both example graphs are circular, which may represent the fact that common nodes occur within long terminal repeats (LTRs). Most of the HIV references used in this work were modeled after the proviral sequence, which includes 5′ and 3′ LTRs.

**Figure 5**
FIVE index schema. Each table (boxes) represents the output from the different annotation efforts towards FIVE. For each table, the title of the table is white in a blue rectangle (*accession2species*, *combined_known_interactions*, *cdd_data*, *spacer_db*, *domains_viral_cds_tblastn,* and *hiv_a_jrefs_k41_t2*), immediately followed by the field names or categories for that given table. Each line corresponds to a field, in which the first column gives the abbreviation name for the content of the field and the second column the format of the content (int for integers, char for strings of characters, float and decimals). Primary keys for each table are found in bold. It is possible to both access each one of the tables independently and to link primary keys from one table to fields from another table, generating a link (in grey).

See this image and copyright information in PMC

References

1. Mardis E.R. A decade’s perspective on DNA sequencing technology. Nature. 2011;470:198–203. doi: 10.1038/nature09796. - DOI - PubMed
1. Kodama Y., Shumway M., Leinonen R. The sequence read archive: Explosive growth of sequencing data. Nucleic Acids Res. 2012;40:D54–D56. doi: 10.1093/nar/gkr854. - DOI - PMC - PubMed
1. SRA Database Growth. [(accessed on 3 December 2020)]; Available online: https://www.ncbi.nlm.nih.gov/sra/docs/sragrowth/
1. Connor R., Brister R., Buchmann J., Deboutte W., Edwards R., Martí-Carreras J., Tisza M., Zalunin V., Andrade-Martínez J., Cantu A., et al. NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements. Genes (Basel). 2019;10:714. doi: 10.3390/genes10090714. - DOI - PMC - PubMed
1. STRIDES Initiative. [(accessed on 3 December 2020)]; Available online: https://datascience.nih.gov/strides.

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

NCBI's Virus Discovery Codeathon: Building "FIVE" -The Federated Index of Viral Experiments API Index

Affiliations

NCBI's Virus Discovery Codeathon: Building "FIVE" -The Federated Index of Viral Experiments API Index

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials