Preparing Distributed Computing Operations for the HL-LHC Era With Operational Intelligence

Alessandro Di Girolamo¹, Federica Legger², Panos Paparrigopoulos¹, Jaroslava Schovancová¹, Thomas Beermann³, Michael Boehler⁴, Daniele Bonacorsi^{5

6}, Luca Clissa^{5

6}, Leticia Decker de Sousa^{5

6}, Tommaso Diotalevi^{5

6}, Luca Giommi^{5

6}, Maria Grigorieva⁷, Domenico Giordano¹, David Hohn⁴, Tomáš Javůrek¹, Stephane Jezequel⁸, Valentin Kuznetsov⁹, Mario Lassnig¹, Vasilis Mageirakos¹, Micol Olocco², Siarhei Padolski¹⁰, Matteo Paltenghi¹, Lorenzo Rinaldi^{5

6}, Mayank Sharma¹, Simone Rossi Tisbeni¹¹, Nikodemas Tuckus¹²

Affiliations

¹ CERN, Geneva, Switzerland.
² INFN Turin, Torino, Italy.
³ Bergische Universitaet Wuppertal, Wuppertal, Germany.
⁴ Physikalisches Institut, Albert-Ludwigs-Universitaet Freiburg, Freiburg, Germany.
⁵ University of Bologna, Bologna, Italy.
⁶ INFN Bologna, Bologna, Italy.
⁷ Lomonosov Moscow State University, Moscow, Russia.
⁸ LAPP, Université Grenoble Alpes, Univrsité. Savoie Mont Blanc, CNRS/IN2P3, Annecy, France.
⁹ Cornell University, Ithaca, NY, United States.
¹⁰ Brookhaven National Laboratory, Upton, NY, United States.
¹¹ INFN-CNAF Bologna, Bologna, Italy.
¹² Vilnius University, Vilnius, Lithuania.

PMID: 35072060
PMCID: PMC8776639
DOI: 10.3389/fdata.2021.753409

Preparing Distributed Computing Operations for the HL-LHC Era With Operational Intelligence

Alessandro Di Girolamo et al. Front Big Data. 2022.

. 2022 Jan 7:4:753409.

doi: 10.3389/fdata.2021.753409. eCollection 2021.

Authors

Affiliations

¹ CERN, Geneva, Switzerland.
² INFN Turin, Torino, Italy.
³ Bergische Universitaet Wuppertal, Wuppertal, Germany.
⁴ Physikalisches Institut, Albert-Ludwigs-Universitaet Freiburg, Freiburg, Germany.
⁵ University of Bologna, Bologna, Italy.
⁶ INFN Bologna, Bologna, Italy.
⁷ Lomonosov Moscow State University, Moscow, Russia.
⁸ LAPP, Université Grenoble Alpes, Univrsité. Savoie Mont Blanc, CNRS/IN2P3, Annecy, France.
⁹ Cornell University, Ithaca, NY, United States.
¹⁰ Brookhaven National Laboratory, Upton, NY, United States.
¹¹ INFN-CNAF Bologna, Bologna, Italy.
¹² Vilnius University, Vilnius, Lithuania.

PMID: 35072060
PMCID: PMC8776639
DOI: 10.3389/fdata.2021.753409

Abstract

As a joint effort from various communities involved in the Worldwide LHC Computing Grid, the Operational Intelligence project aims at increasing the level of automation in computing operations and reducing human interventions. The distributed computing systems currently deployed by the LHC experiments have proven to be mature and capable of meeting the experimental goals, by allowing timely delivery of scientific results. However, a substantial number of interventions from software developers, shifters, and operational teams is needed to efficiently manage such heterogenous infrastructures. Under the scope of the Operational Intelligence project, experts from several areas have gathered to propose and work on "smart" solutions. Machine learning, data mining, log analysis, and anomaly detection are only some of the tools we have evaluated for our use cases. In this community study contribution, we report on the development of a suite of operational intelligence services to cover various use cases: workload management, data management, and site operations.

Keywords: HL-LHC; ML; NLP; distributed computing operations; operational intelligence; resources optimization.

Copyright © 2022 Di Girolamo, Legger, Paparrigopoulos, Schovancová, Beermann, Boehler, Bonacorsi, Clissa, Decker de Sousa, Diotalevi, Giommi, Grigorieva, Giordano, Hohn, Javůrek, Jezequel, Kuznetsov, Lassnig, Mageirakos, Olocco, Padolski, Paltenghi, Rinaldi, Sharma, Tisbeni and Tuckus.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**FIGURE 1**
**(A)** Time-to-start in minutes of HammerCloud functional test jobs used for auto-exclusion and re-inclusion into the ATLAS Grid WFMS. **(B)** Example histogram from 2020-11-24. Number of job shaping actions every 30 min (empty: increase; filled: decrease) of the parallel running test jobs.

**FIGURE 2**
Example of an error message cluster summary.

**FIGURE 3**
Time evolution of cluster 0: the plot shows the count of errors in bins of 10 min.

See this image and copyright information in PMC

References

1. Aimar A., Corman A. A., Andrade P., Belov S., Fernandez J. D., Bear B. G., et al. (2017). Unified Monitoring Architecture for IT and Grid Services. J. Phys. Conf. Ser. 898, 092033. 10.1088/1742-6596/898/9/092033 - DOI
1. Andreeva J., Borrego Iglesias C., Campana S., Di Girolamo A., Dzhunov I., Espinal Curull X., et al. (2012). Automating ATLAS Computing Operations Using the Site Status Board. J. Phys. Conf. Ser. 396 (3), 032072. 10.1088/1742-6596/396/3/032072 - DOI
1. Andreeva J., Dhara P., Di Girolamo A., Kakkar A., Litmaath M., Magini N., et al. (2012). New Solutions for Large Scale Functional Tests in the Wlcg Infrastructure with Sam/nagios: the Experiments Experience. J. Phys. Conf. Ser. 396 (3), 032100. 10.1088/1742-6596/396/3/032100 - DOI
1. Anisenkov A., Andreeva J., Di Girolamo A., Paparrigopoulos P., Vasilev B. (2020). CRIC: Computing Resource Information Catalogue as a Unified Topology System for a Large Scale, Heterogeneous and Dynamic Computing Infrastructure. EPJ Web Conf. 245, 03032. 10.1051/epjconf/202024503032 - DOI
1. Antoni T., Bühler W., Dres H., Grein G., Roth M. (2008). Global Grid User Support-Building a Worldwide Distributed User Support Infrastructure. J. Phys. Conf. Ser. 119 (5), 052002. 10.1088/1742-6596/119/5/052002 - DOI

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Preparing Distributed Computing Operations for the HL-LHC Era With Operational Intelligence

Affiliations

Preparing Distributed Computing Operations for the HL-LHC Era With Operational Intelligence

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Research Materials