. 2022 May 11:5:883341.

doi: 10.3389/fdata.2022.883341. eCollection 2022.

FAIR Digital Twins for Data-Intensive Research

Erik Schultes^{1

2}, Marco Roos^{1

3}, Luiz Olavo Bonino da Silva Santos⁴, Giancarlo Guizzardi^{4

5}, Jildau Bouwman^{1

6}, Thomas Hankemeier⁷, Arie Baak⁸, Barend Mons^{1

2

3

7}

Affiliations

¹ Leiden Institute for FAIR and Equitable Science, Leiden, Netherlands.
² GO FAIR Foundation, Leiden, Netherlands.
³ Human Genetics Department, Leiden University Medical Center, Leiden, Netherlands.
⁴ Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente, Enschede, Netherlands.
⁵ Faculty of Computer Science, Free University of Bozen-Bolzano, Bolzano, Italy.
⁶ Netherlands Organisation for Applied Scientific Research, Zeist, Netherlands.
⁷ Leiden Academic Centre for Drug Research, Leiden University, Leiden, Netherlands.
⁸ Euretos, Utrecht, Netherlands.

PMID: 35647536
PMCID: PMC9130601
DOI: 10.3389/fdata.2022.883341

FAIR Digital Twins for Data-Intensive Research

Erik Schultes et al. Front Big Data. 2022.

. 2022 May 11:5:883341.

doi: 10.3389/fdata.2022.883341. eCollection 2022.

Authors

Erik Schultes^{1

2}, Marco Roos^{1

3}, Luiz Olavo Bonino da Silva Santos⁴, Giancarlo Guizzardi^{4

5}, Jildau Bouwman^{1

6}, Thomas Hankemeier⁷, Arie Baak⁸, Barend Mons^{1

2

3

7}

Affiliations

¹ Leiden Institute for FAIR and Equitable Science, Leiden, Netherlands.
² GO FAIR Foundation, Leiden, Netherlands.
³ Human Genetics Department, Leiden University Medical Center, Leiden, Netherlands.
⁴ Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente, Enschede, Netherlands.
⁵ Faculty of Computer Science, Free University of Bozen-Bolzano, Bolzano, Italy.
⁶ Netherlands Organisation for Applied Scientific Research, Zeist, Netherlands.
⁷ Leiden Academic Centre for Drug Research, Leiden University, Leiden, Netherlands.
⁸ Euretos, Utrecht, Netherlands.

PMID: 35647536
PMCID: PMC9130601
DOI: 10.3389/fdata.2022.883341

Abstract

Although all the technical components supporting fully orchestrated Digital Twins (DT) currently exist, what remains missing is a conceptual clarification and analysis of a more generalized concept of a DT that is made FAIR, that is, universally machine actionable. This methodological overview is a first step toward this clarification. We present a review of previously developed semantic artifacts and how they may be used to compose a higher-order data model referred to here as a FAIR Digital Twin (FDT). We propose an architectural design to compose, store and reuse FDTs supporting data intensive research, with emphasis on privacy by design and their use in GDPR compliant open science.

Keywords: FAIR Digital Object; FAIR Digital Twin; FAIR guiding principles; Knowlet; augmented reasoning; data stewardship; machine learning; nanopublications.

PubMed Disclaimer

Conflict of interest statement

ES and BM were employed by GO FAIR Foundation. AB was employed by Euretos. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
People think in “concepts” as in: “unique units of thought” and how they are meaningfully connected (Mons and Velterop, 2009). That unit of thought can refer to either a physical object in reality, or an “intellectual concept,” an abstraction, or a “mental construct” such as “cancer” (also a “type”) vs. a physical instance of a tumor, or an abstract concept without a physical twin. Humans use a multitude of symbols, words, tokens, and identifiers to “refer to” the concept they have in mind. This ambiguity in human language (and pointedly also in classical scientific narrative) is a source of confusion because of synonyms, homonyms, (mis)translation into other languages, and semantic as well as conceptual drift. All of these will be rather superficially addressed in FDT context later.

**Figure 2**
People have multiple “senses” to disambiguate the exact meaning of the object or the contract they are “talking about,” and a unique and highly developed skill set to communicate, even across national languages and jargon, with a reasonable outcome. Still, it could be argued that many misunderstandings between cultures, but even between scientists, can be traced back to either “false agreements” (we think we are talking about the same exact concept, but we are not) and “false disagreements” (we think we talk about significantly different things, while in fact we are not, and the consensus is much bigger than we experience). In science, the sources of ambiguity should ideally be kept to the absolute, unavoidable, minimum. The good news is that moving to machine readable communication of the essence of our scientific findings will also help human communication by reducing both false agreements and false disagreements (Mons et al., 2011).

**Figure 3**
Machines can only deal -internally- with “digital objects.” These are nothing else than a “bitstream” maybe comparable to “series of spikes” in the sensory system of people. While we do not even know whether the same object triggers the same “bitstream” in different people, while they may still call the same emission of light “red,” machines need very precise instructions about the defined meaning referred to by a particular bitstream and therefore each “concept” for a machine (a disease, a drug a person etc.) should have what we call a Globally Unique, Persistent and Resolvable Identifier (GUPRI). In other words, the bitstream that “refers” to a concept (be it another digital object, an object in the physical world, or a mental construct) should resolve (universally) to one, and only one, intended defined meaning at any time. Once we have agreed on such a GUPRI for each concept we deal with in science, we can communicate in a much more precise way, also as humans.

**Figure 4**
Machines can deal with precise concepts if they are properly defined (the definition preferably also in machine interpretable format). For machines, concepts and their relations should be precisely defined, and it is very important that the association (the “predicate” in semantic triple terms) is also a well-defined concept, with its own GUPRI. Once that is done properly, machines have a magnificent ability to discover and interpret complex patterns in massive amounts of data and information. With simple mapping tables, they are also able to output this information in Graphical User Interfaces and sounds in a human understandable and non-ambiguous way (and in multiple languages).

**Figure 5**
An example of a nanopublication file in RDF format. Please note that the different colored sections, which appear here as one file, are in fact going to be stored in separate, but linked containers as indicated in the current FDO schema. This process is currently ongoing, and the outcome of the process will be reported in a technical paper to be published soon. Online versions of these impressions can be found here: a nanopublication schema and a nanopublication example.

**Figure 6**
The artist impression of the anatomy of a nanopublication, building on the earlier definitions of Mons and Velterop (2009) and Groth et al. (2010) as a FAIR Digital Object (FDO)⁹: the subject, the predicate and the object are all referred to with a GUPRI (usually in RDF context). The “container” of the triple is typed RDF triple, or when more elaborate, RDF graph and has itself a GUPRI. Its metadata (mostly multiple containers, continuing multiple assertions/triples about the nanopublication) are FDOs in and of themselves, annotated with the proper type and their own GUPRI. Following the FAIR principles, each metadata container contains the explicit reference to the FDO it points to via that FDOs GUPRI. We have argued for a decade now that all precise assertions (i.e., “claims”) in science should preferably have the form of machine actionable nanopublications.

**Figure 7**
An artist impression of a cardinal assertion with multiple provenance and publication metadata files associated.

**Figure 8**
Summary of the consolidation from multiple nanopublications with identical assertions to once cardinal assertion with multiple provenance files, allowing detailed study of the different sources for this assertion. **(A)** depicts a typical nanopublication having an assertion (green-yellow-red) and provenance (purple) while **(B)** is a different nanopublication having the same assertion (green-yellow-red) but different provenance (red). **(C)** depicts how the single cardinal assertion (green-yellow-red) links to many independent nanopublications (multiple provenance). In computer reasoning, the use of the actual provenance files will be limited, although they can be used as a source for a numerical “evidence level” representing a subjective level of trustworthiness in a given context and for a given purpose. Metadata files that state contesting opinions and rendering the assertion less likely or even controversial can be used as well.

**Figure 9**
A Knowlet file is created from cardinal assertions with the same subject (GUPRI) [adapted from Mons (2019)]. A typical Knowlet will contain hundreds, thousands or even millions of FDOs (GUPRIs for all concepts, Cardinal Assertions, and custom references to external provenance files). By filtering the cardinal assertions in a Knowlet based on any type of recorded feature, such as predicates, object-semantic types, for instance “drug,” or time date stamps) we can create a “qua” (Masolo et al., ; Guizzardi, 2006) of any Knowlet for customized use.

**Figure 10**
The basic anatomy of a FAIR Digital Object (FDO) (see text footnote⁹). The object (resource) in this case is the actual assertion (single triple or small graph), the FDO Identifier record is associated with a GUPRI that resolves uniquely to this nanopublication. The FDO type is “nanopublication,” which would inform machines about the generic technical actions possible with this type. The metadata (which are usually multiple files) as shown in Figure 5 are stored in a separate container and linked with the predicate < fdof:isMetadataOf> to the FDO identifier record (GUPRI).

**Figure 11**
**(A)** The “conceptual similarity” of two Knowlets (FDTs) can be simply calculated based on existing vector matching approaches. This allows basic hypothesis generation (association between two Knowlets of concepts that have so far been never directly associated). **(B)** Now that machines are able, coming closer to human perception, to see the subject of the Knowlet in a much broader conceptual context (associating all objects and predicates in the Knowlet with the subject), computers are able to much better deal with “near sameness” and subtle semantic differences, without explicitly being instructed via fixed predicates. In static knowledge graphs or ontologies, we cannot simply define a predicate like “nearly identical as,” because this is an intrinsically undefined and thus elusive concept to machines. However, if in the picture **(B)** these three Knowlets for instance represent the homologous gene in *H. sapiens, M. musculus, and R. norvegicus* (man, mouse, rat), the vast majority of the millions of concepts (FDOs) in the three Knowlets may be identical. That means that the machine “knows” that these three concepts are distinct -because they have different central subject GUPRIs as well as overall container GUPRIs- but are “nearly similar” and it “knows” the -relative- extent to which they are similar. Later we will see that their quas (filtered from all concepts related to species and chromosomal location) will be identical. **(C)** Importantly, although single nanopublications (snapshots of an assertion made at a given time) and the content of cardinal assertions should be immutable, and protected by for instance Trusty URIs (Kuhn and Dumontier, 2014), the Knowlet representing a concept is principally mutable and drifting. For instance, if the Knowlet refers to a gene, any new variant found in an instance of the gene detected by sequencing research will add a new assertion to the overall Knowlet of that gene as a mental construct. This will effectively create a new version of the Knowlet (FDO), with a new GUPRI, which in most cases will be 99.999% identical to its immediate predecessor, and secondly can contain the assertion [new GUPRI] previously known as [predecessor GUPRI]. Effectively creating a block-chain type sequence which enables backward recovery of earlier versions of the Knowlet, effectively supporting versioning (for instance of a workflow) or semantic drift detection (of a concept over time or by geographical region/culture). **(D)** Finally, each Knowlet can be “filtered” on any feature that is supported by the internal or externally associated content. For instance, on time date stamps of individual cardinal assertions in the Knowlet, on predicate type, or on the semantic type of the objects in the cardinal assertions. Coming back to B (near sameness): assuming that a very conserved gene/protein like actin would be identical in man, mouse, and rat, the three Knowlets of these three distinct concepts would at least differ on two cardinal assertions, determining the species and the chromosomal location of the gene. The Knowlets will be nearly similar even without filtering, but it will be a relatively straightforward machine-instruction to “ignore species and chromosomal location” and these Knowlet quas will now be actually 100% identical and treated in any graph-reasoning as “one and the same concept” until there is a need to separate them again. Finally, the qua approach might also reduce or even eliminate the need for new GUPRIs for each version of a developing FDT. In fact, older versions can be simply recovered by filtering on all cardinal assertions that were added before a given time.

**Figure 12**
The Knowlets (strongly filtered quas) of three diseases and their gene overlap (Euretos interface). Panel 1 is filters on the semantic type “gene” (object) and for curated annotations only (predicate). No overlap in genes is detected. **(B)** The Knowlet is expanded to the broader concept of Alzheimer's disease. **(C)** The filter of **(A)** is the same but now all literature co-occurrences (not -yet- annotated and some potentially spurious) are included.

**Figure 13**
The visualization of the interconnections between the 68 genes associated with three diseases (filter setting is Figure 11C) (Euretos interface). Each line in this graph is representing a cardinal assertion (as in Figure 8) and all provenance (type of relation = predicate), time date stamp and (mostly multiple) sources of that cardinal assertion can be explored (effectively showing all nanopublications supporting the assertion (as in Figure 6).

See this image and copyright information in PMC

References

1. Collins S., Genova F., Harrower N., Hodson S., Jones S., Laaksonen L., et al. . (2018). Turning FAIR into Reality. Final Report and Action Plan from the European Commission Expert Group on FAIR Data (European Commission).
1. Gibson J. C. J., van Dam E. A., Schultes M., Roos B. M. (2012). “Towards computational evaluation of evidence for scientific assertions with nanopublications and cardinal assertions,” in EUR Workshop Proceedings. Available online at: https://www.mendeley.com/research-papers/towards-computational-evaluatio...
1. Grieves M. (2019). “Virtually intelligent product systems: digital and physical twins,” in Complex Systems Engineering: Theory and Practice, eds S. Flumerfelt, K. G. Schwartz, D, Marries, S, Briceno, and T. C. Lieuwen (Portland, OR: American Institute of Aeronautics and Astronautics; ), 175–200. 10.2514/5.9781624105654.0175.0200 - DOI
1. Groth P., Gibson A., Velterop J. (2010). The anatomy of a nano-publication. Inform Serv. Use 30, 1–2. 10.3233/ISU-2010-0613 - DOI
1. Guizzardi G. (2006). “Agent roles, qua individuals and the counting problem,” in Invited Chapter in Software Engineering of Multi-Agent Systems, Vol. 4, eds P. Giorgini, A.Garcia, C. Lucena, R. Choren (Berlin; Heidelberg: Springer-Verlag; ), 143–160. Available online at: https://www.researchgate.net/publication/225486025_Agent_Roles_Qua_Indiv...

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

FAIR Digital Twins for Data-Intensive Research

Affiliations

FAIR Digital Twins for Data-Intensive Research

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources