Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 22:2025:baae132.
doi: 10.1093/database/baae132.

Standardized pipelines support and facilitate integration of diverse datasets at the Rat Genome Database

Affiliations

Standardized pipelines support and facilitate integration of diverse datasets at the Rat Genome Database

Jennifer R Smith et al. Database (Oxford). .

Abstract

The Rat Genome Database (RGD) is a multispecies knowledgebase which integrates genetic, multiomic, phenotypic, and disease data across 10 mammalian species. To support cross-species, multiomics studies and to enhance and expand on data manually extracted from the biomedical literature by the RGD team of expert curators, RGD imports and integrates data from multiple sources. These include major databases and a substantial number of domain-specific resources, as well as direct submissions by individual researchers. The incorporation of these diverse datatypes is handled by a growing list of automated import, export, data processing, and quality control pipelines. This article outlines the development over time of a standardized infrastructure for automated RGD pipelines with a summary of key design decisions and a focus on lessons learned.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
The drop-and-reload paradigm. Dropping and reloading data introduces multiple potential points of failure in a sequence of data loading pipelines. At each step, the loading pipeline pulls data in from the source (INCOMING DATA 1, 2, 3, …, n). The corresponding data in the database, represented by boxes with striped backgrounds, are removed in preparation for the newly imported data to be loaded (boxes with solid backgrounds). A pipeline failure, denoted by a red X and the absence of the corresponding box within the database icon, prevents the sequence from continuing until the problem has been fixed and that pipeline has run successfully. If there are additional issues at any subsequent steps—either pre-existing or potentially introduced by a previous repair—the load fails again until those are also fixed. In this way, drop-and-reload introduces the risk that substantial sets of data could be dropped and not reloaded, compromising data integrity, and that data releases to the public might need to be delayed.
Figure 2.
Figure 2.
Diagram illustrating how only updating a subset of the data in a database when that subset changes at the source can prevent a failure in one pipeline from interfering with other data loading pipelines that run against the same datastore.
Figure 3.
Figure 3.
Flowchart illustrating the steps of the PharmGKB pipeline which associates PharmGKB identifiers with human gene records in the RGD database.
Figure 4.
Figure 4.
The list of folders and files that constitute the PharmGKB pipeline with short descriptions of the function of each file in the list, presented as an image.

Similar articles

  • The Rat Genome Database (RGD) facilitates genomic and phenotypic data integration across multiple species for biomedical research.
    Kaldunski ML, Smith JR, Hayman GT, Brodie K, De Pons JL, Demos WM, Gibson AC, Hill ML, Hoffman MJ, Lamers L, Laulederkind SJF, Nalabolu HS, Thorat K, Thota J, Tutaj M, Tutaj MA, Vedi M, Wang SJ, Zacher S, Dwinell MR, Kwitek AE. Kaldunski ML, et al. Mamm Genome. 2022 Mar;33(1):66-80. doi: 10.1007/s00335-021-09932-x. Epub 2021 Nov 5. Mamm Genome. 2022. PMID: 34741192 Free PMC article.
  • The Rat Genome Database 2013--data, tools and users.
    Laulederkind SJ, Hayman GT, Wang SJ, Smith JR, Lowry TF, Nigam R, Petri V, de Pons J, Dwinell MR, Shimoyama M, Munzenmaier DH, Worthey EA, Jacob HJ. Laulederkind SJ, et al. Brief Bioinform. 2013 Jul;14(4):520-6. doi: 10.1093/bib/bbt007. Epub 2013 Feb 22. Brief Bioinform. 2013. PMID: 23434633 Free PMC article.
  • Rat Genome Database (RGD): mapping disease onto the genome.
    Twigger S, Lu J, Shimoyama M, Chen D, Pasko D, Long H, Ginster J, Chen CF, Nigam R, Kwitek A, Eppig J, Maltais L, Maglott D, Schuler G, Jacob H, Tonellato PJ. Twigger S, et al. Nucleic Acids Res. 2002 Jan 1;30(1):125-8. doi: 10.1093/nar/30.1.125. Nucleic Acids Res. 2002. PMID: 11752273 Free PMC article.
  • Exploring human disease using the Rat Genome Database.
    Shimoyama M, Laulederkind SJ, De Pons J, Nigam R, Smith JR, Tutaj M, Petri V, Hayman GT, Wang SJ, Ghiasvand O, Thota J, Dwinell MR. Shimoyama M, et al. Dis Model Mech. 2016 Oct 1;9(10):1089-1095. doi: 10.1242/dmm.026021. Dis Model Mech. 2016. PMID: 27736745 Free PMC article. Review.
  • The Rat Genome Database: Genetic, Genomic, and Phenotypic Data Across Multiple Species.
    Laulederkind SJF, Hayman GT, Wang SJ, Kaldunski ML, Vedi M, Demos WM, Tutaj M, Smith JR, Lamers L, Gibson AC, Thorat K, Thota J, Tutaj MA, de Pons JL, Dwinell MR, Kwitek AE. Laulederkind SJF, et al. Curr Protoc. 2023 Jun;3(6):e804. doi: 10.1002/cpz1.804. Curr Protoc. 2023. PMID: 37347557 Free PMC article. Review.

References

    1. Vedi M, Smith JR, Thomas Hayman G et al. 2022 updates to the Rat Genome Database: a findable, accessible, interoperable, and reusable (FAIR) resource. Genetics 2023;224:iyad042. doi: 10.1093/genetics/iyad042 - DOI - PMC - PubMed
    1. Smith JR, Hayman GT, Wang S-J et al. The year of the rat: the Rat Genome Database at 20: a multi-species knowledgebase and analysis platform. Nucleic Acids Res 2020;48:D731–42. - PMC - PubMed
    1. Twigger S, Lu J, Shimoyama M et al. Rat Genome Database (RGD): mapping disease onto the genome. Nucleic Acids Res 2002;30:125–8. doi: 10.1093/nar/30.1.125 - DOI - PMC - PubMed
    1. Kaldunski ML, Smith JR, Hayman GT et al. The Rat Genome Database (RGD) facilitates genomic and phenotypic data integration across multiple species for biomedical research. Mamm Genome 2022;33:66–80. doi: 10.1007/s00335-021-09932-x - DOI - PMC - PubMed
    1. Laulederkind SJF, Hayman GT, Wang SJ et al. The Rat Genome Database: genetic, genomic, and phenotypic data across multiple species. Curr Protoc 2023;3:e804. doi: 10.1002/cpz1.804 - DOI - PMC - PubMed

LinkOut - more resources