Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 2:13:giae058.
doi: 10.1093/gigascience/giae058.

An interconnected data infrastructure to support large-scale rare disease research

Collaborators, Affiliations

An interconnected data infrastructure to support large-scale rare disease research

Lennart F Johansson et al. Gigascience. .

Abstract

The Solve-RD project brings together clinicians, scientists, and patient representatives from 51 institutes spanning 15 countries to collaborate on genetically diagnosing ("solving") rare diseases (RDs). The project aims to significantly increase the diagnostic success rate by co-analyzing data from thousands of RD cases, including phenotypes, pedigrees, exome/genome sequencing, and multiomics data. Here we report on the data infrastructure devised and created to support this co-analysis. This infrastructure enables users to store, find, connect, and analyze data and metadata in a collaborative manner. Pseudonymized phenotypic and raw experimental data are submitted to the RD-Connect Genome-Phenome Analysis Platform and processed through standardized pipelines. Resulting files and novel produced omics data are sent to the European Genome-Phenome Archive, which adds unique file identifiers and provides long-term storage and controlled access services. MOLGENIS "RD3" and Café Variome "Discovery Nexus" connect data and metadata and offer discovery services, and secure cloud-based "Sandboxes" support multiparty data analysis. This successfully deployed and useful infrastructure design provides a blueprint for other projects that need to analyze large amounts of heterogeneous data.

Keywords: bioinformatics; computational biology; fair data; genetics; infrastructure; rare disease.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1:
Figure 1:
Rare disease analysis infrastructure overview. deep-ES: deep sequencing ES; EGA: European Genome-Phenome Archive; ERN: European Reference Network; ES: exome sequencing; GPAP: Genome-Phenome Analysis Platform; GS: genome sequencing; LR-GS: long-read genome sequencing; LR-RNAseq: long-read RNA-sequencing; SR-GS: short-read genome sequencing; SR-RNAseq: short-read RNA sequencing; UI: user interface. The Solve-RD dataset is also discoverable through the participation of the RD-Connect GPAP in Matchmaker exchange and the Beacon Network.
Figure 2:
Figure 2:
Sandbox folder structure. Data are organized by the data analysis working groups (DATF working group [WG]) in either folders per European Reference Network (ERN) or a common folder (for data intended for all ERNs). Additionally, large files that should be kept but not shared are stored in a “Sandbox only” folder. All data to be shared with the ERNs are linked to an sftp folder with a subfolder per ERN accessible via SFTP access protocol. Thin arrows indicate links between specific subfolders. These folders are further synchronized to 2 folders: DATF and DITF, each with the same information (indicated by the thick arrow). The DATF folder has the same structure as the initial sftp folder (a folder for each DATF WG with subfolders per ERN). The DITF folder has the converse structure (a folder for each DITF ERN with subfolders per WG). This structure makes it easy for both DATF and DITF to browse the data (e.g., all CNV data or all data from ERN-ITHACA).
Figure 3:
Figure 3:
Data and metadata relations within Solve-RD. Arrows indicate the “derived from” direction (e.g., Sample DNA00001 is derived from Subject P00001). We distinguish 4 main data/metadata types: subject, sample, experiments, and files, with each derived from the former. This figure is actually a simplification as data are further organized in data releases we call “freezes” and can be used in different combinations as “analyses.”
Figure 4:
Figure 4:
Solve-RD RD3 LabInfo screen showing a subset of the Freeze1 experiment data. On the left, entries are filtered on patch “Original data” and columns are filtered on interest. In the current view, the experimentID is connected to the sample on which the experiment was performed. In addition, information on the experiment is shown. For these samples, genomic data were the input for exome sequencing experiments on which various different enrichment kits were used. For most of the samples, statistics on the average target coverage (MeanCov) and number of bases covered by at least 20 sequencing reads (C20) was available. If a subject was retracted from the project, all metadata except identifiers were removed from the database and the experiment was labeled as retracted.
Figure 5:
Figure 5:
(A) Discovery Nexus query interface. This interface supports querying by any combination of various demographic and inheritance (Subject Filters), phenotypes (HPO Query Builder), diseases (ORDO Query Builders), or suspected variant filters (Variant Filter). In the HPO Query Builder, typing any part of an HPO phenotype term or code creates a visible list of relevant items to select from, whereupon they are transferred into the adjacent panel to form part of the query. Phenotype matching can specify matching on identical terms only (exact) or recover similar terms (based on a precomputed matrix of relationship scores and the position of the slider). The minimum number of matching terms can also be specified, creating an “OR” query, and settings above the minimum create a query that returns results that match at least the specified number of terms in any combination. HPO queries can also be instructed to interrogate phenotype data stored as ORDO terms. Matching of HPO to ORDO terms (in the ORDO Query Builder) is controlled by the HPO pairwise similarity slider, to define the number of HPO terms that should match an ORDO term as well as the ORDO match scale, defining the specificity of the HPO term(s) to the selected ORDO term (based on a precomputed matrix of their occurrence across all ORDO terms). Hence, when mapping ORDO to HPO terms, exact matching will traverse the mapping of these 2 term sets to find fewer but more specific HPO terms, while minimum matching will include more HPO terms, but these may match other ORDO terms as well. Variant data cannot be filtered at the specific base-change level (as this would raise privacy concerns) but are instead queryable by host gene, allele frequency, and mutation type using the Variant Query Builder. It is also possible to filter for variants based on affected biochemical pathways, given known relationships between genes and pathways (using the Reactome Knowledge base [60]). Finally, the ERN dataset to be queried must be explicitly stated and requires that the user has permission to query the specified ERNs. (B) Discovery Nexus Query Results. After submitting the query using the “Build query button,” the system will return a count for matching results in the resources selected. Clicking on the number in the blue box will bring up the summary pop-up window as shown above, giving basic details of the matches (again subject to the user having been assigned permissions). The blue “Get Full Data for Selected Subjects” will open a link to request access from the resources holding the required data (where this is available). Alternatively, clicking the green button in the source details will open a summary page with contact details for the resource, where a direct link to request the data is not available.
Figure 5:
Figure 5:
(A) Discovery Nexus query interface. This interface supports querying by any combination of various demographic and inheritance (Subject Filters), phenotypes (HPO Query Builder), diseases (ORDO Query Builders), or suspected variant filters (Variant Filter). In the HPO Query Builder, typing any part of an HPO phenotype term or code creates a visible list of relevant items to select from, whereupon they are transferred into the adjacent panel to form part of the query. Phenotype matching can specify matching on identical terms only (exact) or recover similar terms (based on a precomputed matrix of relationship scores and the position of the slider). The minimum number of matching terms can also be specified, creating an “OR” query, and settings above the minimum create a query that returns results that match at least the specified number of terms in any combination. HPO queries can also be instructed to interrogate phenotype data stored as ORDO terms. Matching of HPO to ORDO terms (in the ORDO Query Builder) is controlled by the HPO pairwise similarity slider, to define the number of HPO terms that should match an ORDO term as well as the ORDO match scale, defining the specificity of the HPO term(s) to the selected ORDO term (based on a precomputed matrix of their occurrence across all ORDO terms). Hence, when mapping ORDO to HPO terms, exact matching will traverse the mapping of these 2 term sets to find fewer but more specific HPO terms, while minimum matching will include more HPO terms, but these may match other ORDO terms as well. Variant data cannot be filtered at the specific base-change level (as this would raise privacy concerns) but are instead queryable by host gene, allele frequency, and mutation type using the Variant Query Builder. It is also possible to filter for variants based on affected biochemical pathways, given known relationships between genes and pathways (using the Reactome Knowledge base [60]). Finally, the ERN dataset to be queried must be explicitly stated and requires that the user has permission to query the specified ERNs. (B) Discovery Nexus Query Results. After submitting the query using the “Build query button,” the system will return a count for matching results in the resources selected. Clicking on the number in the blue box will bring up the summary pop-up window as shown above, giving basic details of the matches (again subject to the user having been assigned permissions). The blue “Get Full Data for Selected Subjects” will open a link to request access from the resources holding the required data (where this is available). Alternatively, clicking the green button in the source details will open a summary page with contact details for the resource, where a direct link to request the data is not available.

Similar articles

  • The RD-Connect Registry & Biobank Finder: a tool for sharing aggregated data and metadata among rare disease researchers.
    Gainotti S, Torreri P, Wang CM, Reihs R, Mueller H, Heslop E, Roos M, Badowska DM, de Paulis F, Kodra Y, Carta C, Martìn EL, Miller VR, Filocamo M, Mora M, Thompson M, Rubinstein Y, Posada de la Paz M, Monaco L, Lochmüller H, Taruscio D. Gainotti S, et al. Eur J Hum Genet. 2018 May;26(5):631-643. doi: 10.1038/s41431-017-0085-z. Epub 2018 Feb 2. Eur J Hum Genet. 2018. PMID: 29396563 Free PMC article.
  • The RD-Connect Genome-Phenome Analysis Platform: Accelerating diagnosis, research, and gene discovery for rare diseases.
    Laurie S, Piscia D, Matalonga L, Corvó A, Fernández-Callejo M, Garcia-Linares C, Hernandez-Ferrer C, Luengo C, Martínez I, Papakonstantinou A, Picó-Amador D, Protasio J, Thompson R, Tonda R, Bayés M, Bullich G, Camps-Puchadas J, Paramonov I, Trotta JR, Alonso A, Attimonelli M, Béroud C, Bros-Facer V, Buske OJ, Cañada-Pallarés A, Fernández JM, Hansson MG, Horvath R, Jacobsen JOB, Kaliyaperumal R, Lair-Préterre S, Licata L, Lopes P, López-Martín E, Mascalzoni D, Monaco L, Pérez-Jurado LA, Posada de la Paz M, Rambla J, Rath A, Riess O, Robinson PN, Salgado D, Smedley D, Spalding D, 't Hoen PAC, Töpf A, Zaharieva I, Graessner H, Gut IG, Lochmüller H, Beltran S. Laurie S, et al. Hum Mutat. 2022 Jun;43(6):717-733. doi: 10.1002/humu.24353. Hum Mutat. 2022. PMID: 35178824 Free PMC article.
  • Remote visualization of large-scale genomic alignments for collaborative clinical research and diagnosis of rare diseases.
    Corvò A, Matalonga L, Spalding D, Senf A, Laurie S, Picó-Amador D, Fernandez-Callejo M, Paramonov I, Romero AF, Garcia-Rios E, Ciges JI, Mohan A, Thomas C, Silva Valencia AF, Halmagyi C, Freeberg MA, Töpf A, Horvath R, Saunders G, Gut I, Keane T, Piscia D, Beltran S. Corvò A, et al. Cell Genom. 2023 Jan 11;3(2):100246. doi: 10.1016/j.xgen.2022.100246. eCollection 2023 Feb 8. Cell Genom. 2023. PMID: 36819661 Free PMC article.
  • Harmonising phenomics information for a better interoperability in the rare disease field.
    Maiella S, Olry A, Hanauer M, Lanneau V, Lourghi H, Donadille B, Rodwell C, Köhler S, Seelow D, Jupp S, Parkinson H, Groza T, Brudno M, Robinson PN, Rath A. Maiella S, et al. Eur J Med Genet. 2018 Nov;61(11):706-714. doi: 10.1016/j.ejmg.2018.01.013. Epub 2018 Feb 7. Eur J Med Genet. 2018. PMID: 29425702 Review.
  • Development of Bioinformatics Infrastructure for Genomics Research.
    Mulder NJ, Adebiyi E, Adebiyi M, Adeyemi S, Ahmed A, Ahmed R, Akanle B, Alibi M, Armstrong DL, Aron S, Ashano E, Baichoo S, Benkahla A, Brown DK, Chimusa ER, Fadlelmola FM, Falola D, Fatumo S, Ghedira K, Ghouila A, Hazelhurst S, Isewon I, Jung S, Kassim SK, Kayondo JK, Mbiyavanga M, Meintjes A, Mohammed S, Mosaku A, Moussa A, Muhammd M, Mungloo-Dilmohamud Z, Nashiru O, Odia T, Okafor A, Oladipo O, Osamor V, Oyelade J, Sadki K, Salifu SP, Soyemi J, Panji S, Radouani F, Souiai O, Tastan Bishop Ö; H3ABioNet Consortium, as members of the H3Africa Consortium. Mulder NJ, et al. Glob Heart. 2017 Jun;12(2):91-98. doi: 10.1016/j.gheart.2017.01.005. Epub 2017 Mar 13. Glob Heart. 2017. PMID: 28302555 Free PMC article. Review.

Cited by

  • Genomic reanalysis of a pan-European rare-disease resource yields new diagnoses.
    Laurie S, Steyaert W, de Boer E, Polavarapu K, Schuermans N, Sommer AK, Demidov G, Ellwanger K, Paramonov I, Thomas C, Aretz S, Baets J, Benetti E, Bullich G, Chinnery PF, Clayton-Smith J, Cohen E, Danis D, de Sainte Agathe JM, Denommé-Pichon AS, Diaz-Manera J, Efthymiou S, Faivre L, Fernandez-Callejo M, Freeberg M, Garcia-Pelaez J, Guillot-Noel L, Haack TB, Hanna M, Hengel H, Horvath R, Houlden H, Jackson A, Johansson L, Johari M, Kamsteeg EJ, Kellner M, Kleefstra T, Lacombe D, Lochmüller H, López-Martín E, Macaya A, Marcé-Grau A, Maver A, Morsy H, Muntoni F, Musacchia F, Nelson I, Nigro V, Olimpio C, Oliveira C, Paulasová Schwabová J, Pauly MG, Peterlin B, Peters S, Pfundt R, Piluso G, Piscia D, Posada M, Reich S, Renieri A, Ryba L, Šablauskas K, Savarese M, Schöls L, Schütz L, Steinke-Lange V, Stevanin G, Straub V, Sturm M, Swertz MA, Tartaglia M, Te Paske IBAW, Thompson R, Torella A, Trainor C, Udd B, Van de Vondel L, van de Warrenburg B, van Reeuwijk J, Vandrovcova J, Vitobello A, Vos J, Vyhnálková E, Wijngaard R, Wilke C, William D, Xu J, Yaldiz B, Zalatnai L, Zurek B; Solve-RD DITF-GENTURIS; Solve-RD DITF-ITHACA; Solve-RD DITF-EURO-NMD; Solve-RD DITF-RND; Solve-RD consortiu… See abstract for full author list ➔ Laurie S, et al. Nat Med. 2025 Feb;31(2):478-489. doi: 10.1038/s41591-024-03420-w. Epub 2025 Jan 17. Nat Med. 2025. PMID: 39825153 Free PMC article.

References

    1. Zurek B, Ellwanger K, Vissers LELM, et al. . Solve-RD: systematic pan-European data sharing and collaborative analysis to solve rare diseases. Eur J Hum Genet. 2021;29:1325–31. 10.1038/s41431-021-00859-0. - DOI - PMC - PubMed
    1. Laurie S, Piscia D, Matalonga L, et al. . The RD-Connect Genome-Phenome Analysis Platform: accelerating diagnosis, research, and gene discovery for rare diseases. Hum Mutat. 2022;43(6):717–33. 10.1002/humu.24353. - DOI - PMC - PubMed
    1. Swertz MA, Dijkstra M, Adamusiak T, et al. . The MOLGENIS toolkit: rapid prototyping of biosoftware at the push of a button. BMC Bioinf. 2010;11(Suppl. 12):S12. 10.1186/1471-2105-11-S12-S12. - DOI - PMC - PubMed
    1. van der Velde KJ, Imhann F, Charbon B, et al. . MOLGENIS research: advanced bioinformatics data software for non-bioinformaticians. Bioinformatics. 2019;35(6):1076–78. 10.1093/bioinformatics/bty742. - DOI - PMC - PubMed
    1. Lancaster O, Beck T, Atlan D, et al. . Cafe Variome: general-purpose software for making genotype–phenotype data discoverable in restricted or open access contexts. Hum Mutat. 2015;36(10):957–64. 10.1002/humu.22841. - DOI - PubMed