. 2020 Sep 8;7(1):300.

doi: 10.1038/s41597-020-00638-4.

AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance

Sebastiaan P Huber^#^{1

2}, Spyros Zoupanos^#^{3

4}, Martin Uhrin^{3

4}, Leopold Talirz^{3

4

5}, Leonid Kahle^{3

4}, Rico Häuselmann^{3

4}, Dominik Gresch⁶, Tiziano Müller⁷, Aliaksandr V Yakutovich^{3

4

5}, Casper W Andersen^{3

4}, Francisco F Ramirez^{3

4}, Carl S Adorf^{3

4}, Fernando Gargiulo^{3

4}, Snehal Kumbhar^{3

4}, Elsa Passaro^{3

4}, Conrad Johnston^{3

4}, Andrius Merkys⁸, Andrea Cepellotti^{3

4}, Nicolas Mounet^{3

4}, Nicola Marzari^{3

4}, Boris Kozinsky^{9

10}, Giovanni Pizzi^{11

12}

Affiliations

¹ National Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland. mail@sphuber.net.
² Theory and Simulation of Materials (THEOS), Faculté des Sciences et Techniques de l'Ingénieur, École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland. mail@sphuber.net.
³ National Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland.
⁴ Theory and Simulation of Materials (THEOS), Faculté des Sciences et Techniques de l'Ingénieur, École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland.
⁵ Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL), Rue de l'Industrie 17, Sion, CH-1951, Valais, Switzerland.
⁶ Microsoft Station Q, University of California, Santa Barbara, California, 93106-6105, USA.
⁷ Department of Chemistry, University of Zürich, Zürich, Switzerland.
⁸ Vilnius University Institute of Biotechnology, Saulėtekio al. 7, LT-10257, Vilnius, Lithuania.
⁹ John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, 02138, United States.
¹⁰ Robert Bosch LLC, Research and Technology Center North America, 255 Main St, Cambridge, Massachusetts, 02142, USA.
¹¹ National Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland. giovanni.pizzi@epfl.ch.
¹² Theory and Simulation of Materials (THEOS), Faculté des Sciences et Techniques de l'Ingénieur, École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland. giovanni.pizzi@epfl.ch.

^# Contributed equally.

PMID: 32901044
PMCID: PMC7479590
DOI: 10.1038/s41597-020-00638-4

AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance

Sebastiaan P Huber et al. Sci Data. 2020.

. 2020 Sep 8;7(1):300.

doi: 10.1038/s41597-020-00638-4.

Authors

Affiliations

¹ National Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland. mail@sphuber.net.
² Theory and Simulation of Materials (THEOS), Faculté des Sciences et Techniques de l'Ingénieur, École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland. mail@sphuber.net.
³ National Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland.
⁴ Theory and Simulation of Materials (THEOS), Faculté des Sciences et Techniques de l'Ingénieur, École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland.
⁵ Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne (EPFL), Rue de l'Industrie 17, Sion, CH-1951, Valais, Switzerland.
⁶ Microsoft Station Q, University of California, Santa Barbara, California, 93106-6105, USA.
⁷ Department of Chemistry, University of Zürich, Zürich, Switzerland.
⁸ Vilnius University Institute of Biotechnology, Saulėtekio al. 7, LT-10257, Vilnius, Lithuania.
⁹ John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, 02138, United States.
¹⁰ Robert Bosch LLC, Research and Technology Center North America, 255 Main St, Cambridge, Massachusetts, 02142, USA.
¹¹ National Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland. giovanni.pizzi@epfl.ch.
¹² Theory and Simulation of Materials (THEOS), Faculté des Sciences et Techniques de l'Ingénieur, École Polytechnique Fédérale de Lausanne, CH-1015, Lausanne, Switzerland. giovanni.pizzi@epfl.ch.

^# Contributed equally.

PMID: 32901044
PMCID: PMC7479590
DOI: 10.1038/s41597-020-00638-4

Abstract

The ever-growing availability of computing power and the sustained development of advanced computational methods have contributed much to recent scientific progress. These developments present new challenges driven by the sheer amount of calculations and data to manage. Next-generation exascale supercomputers will harden these challenges, such that automated and scalable solutions become crucial. In recent years, we have been developing AiiDA (aiida.net), a robust open-source high-throughput infrastructure addressing the challenges arising from the needs of automated workflow management and data provenance recording. Here, we introduce developments and capabilities required to reach sustained performance, with AiiDA supporting throughputs of tens of thousands processes/hour, while automatically preserving and storing the full data provenance in a relational database making it queryable and traversable, thus enabling high-performance data analytics. AiiDA's workflow language provides advanced automation, error handling features and a flexible plugin model to allow interfacing with external simulation software. The associated plugin registry enables seamless sharing of extensions, empowering a vibrant user community dedicated to making simulations more robust, user-friendly and reproducible.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
Schematic overview of the architecture of AiiDA 1.0.

**Fig. 2**
(a) A schematic provenance graph representing the execution of a workflow W₁ receiving three data nodes D₁, D₂ and D₃ as input, containing the values x, y and z respectively. W₁ computes the expression (x + y) · z by calling two calculations C₁ (to perform the sum) and C2 (to perform the product), forwarding the correct inputs to them. C₁ creates the intermediate node D₄ (with the value x + y) and C₂ then creates the node D₅ with the final result, that is then also returned by W₁. While this simplified example is purely for illustrative purposes, it demonstrates that by storing execution information as a graph, the provenance of all data is fully recorded. (b) Data-provenance layer: it includes calculation and data nodes only, showing the exact sequence of steps that led to the creation of the data nodes. (c) Logical-provenance layer: it hides the details of all intermediate results and focuses only on how the workflow produced the final results from a given set of inputs.

**Fig. 3**
The hierarchy of the node types in AiiDA. This hierarchy is also mirrored in the Python code, where Python classes are used to represent them, using Python’s inheritance model. The different node classes allow to implement custom functionality for each subtype. Additionally, the subclass hierarchy allows to query for specific node types, or a set thereof.

**Fig. 4**
Link types allowed in the AiiDA provenance graph. Rectangles represent node types and arrows connecting them indicate the direction and the type of each link. The symbols at the start and end of each arrow indicate the cardinality of the corresponding link types: 0..1 means that at most one node is allowed on that link endpoint for a given node on the opposite endpoint (for instance, a Data node can have at most one CalculationNode as its creator); 0..* means that any number of nodes is possible (for instance, a CalculationNode can have an arbitrary number of input Data nodes). Additionally, a dagger (†) indicates that link labels must be unique for a given node on the opposite endpoint (for instance, outgoing create links from a CalculationNode must have unique labels).

**Fig. 5**
(a) Schematic of an AiiDA graph that could result from a materials science simulation: as described by the labels, a Density Functional Theory self-consistent field (SCF) calculation and a geometry relaxation of a crystal structure, and a calculation of the “distance” between the initial and final structure. Orange squares represent nodes of type CalculationNode, circles represent Data nodes: blue for crystal structures (of type StructureData) and green for nodes of type Dict (dictionaries of key–value pairs with input parameters or parsed results). (b) Representation of a graph query searching a StructureData node that was an input of a CalculationNode that created a Dict node as output. Labels on the right represent the filter on the node type applied while querying. (c) The four subgraphs that match the query embedded in the entire provenance graph, where the matching nodes and links are colored in red and highlighted by a surrounding border.

**Fig. 6**
Comparison in a log scale of the space requirements and time to solution when querying data with the two AiiDA ORM backends. (a) Space needed to store 10000 structure data objects as raw text files, using the existing EAV-based schema and the new JSON-based schema. The reduced space requirements of the JSON-based schema with respect to the raw text files are due to, among other things, white-space removal. The JSONB schema reduces the required space by a factor of 1.5 compared to the raw file size and a factor of 25 compared to the EAV-based schema. (b) Time for three different queries that return attributes of different size for the same set of nodes. The benchmarks are run on a cold database, meaning that the database caches are emptied before each query. We indicate separately the database query time (SQL time) and the total query time which includes also the construction of the Python objects in memory. The total query time of the *site* attributes in the JSONB format is 75 times smaller compared to the equivalent query in the EAV format. The SQL time for the same query is roughly 6.5 times smaller for the JSONB version of SQL query compared to the EAV version of the query.

**Fig. 7**
Analysis of a sample of one million nodes of the AiiDA graph published in ref. . (a) Frequencies of the number of ancestors and descendants of all nodes. (b) Frequencies of the number of hops, i.e., the distance to reach the farthest ancestor/descendant. (c) Required CPU time when querying for all descendants of 50 top-level nodes in a graph that consists of a number of binary trees of breadth B and depth D using the transitive closure on-the-fly (TC-OTF, diamonds) or the explicitly tabulated transitive closure (TC-TAB, squares).

**Fig. 8**
Process submission and completion rates for the old and new engine. (a) Number of submitted (solid lines) and completed (dashed lines) processes over time for the new engine (both with optimised parameters and with artificial constraints, see text) and the old engine. The submission of the old engine is slightly faster, but despite this the completion rate of the new engine is clearly higher, even under constrained conditions. (b) Number of completed processes for the old (dashed lines) and new (solid lines) engine, decomposed in the separate (sub)processes. The polling-based nature of the old engine is clearly reflected in the stepwise behaviour of the completion rate with processes being finalised in batches. In contrast, the curves for the new engine, due to its event-based design, are smooth and closely packed together, indicating processes being executed in a continuous fashion.

See this image and copyright information in PMC

References

1. Ioannidis JPA, et al. Repeatability of published microarray gene expression analyses. Nat. Genet. 2009;41:149–155. doi: 10.1038/ng.295. - DOI - PubMed
1. Peng RD. Reproducible research in computational science. Sci. 2011;334:1226–1227. doi: 10.1126/science.1213847. - DOI - PMC - PubMed
1. Stoddart, C. Is there a reproducibility crisis in science? Nat., 10.1038/d41586-019-00067-3 (2016).
1. Allison DB, Brown AW, George BJ, Kaiser KA. Reproducibility: A tragedy of errors. Nat. 2016;530:27–29. doi: 10.1038/530027a. - DOI - PMC - PubMed
1. Wilkinson, M. D. et al. The FAIR guiding principles for scientific data management and stewardship. Sci. Data3, 10.1038/sdata.2016.18 (2016). - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance

Affiliations

AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Other Literature Sources