Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 14;9(1):31.
doi: 10.1038/s41597-021-01109-0.

A FAIR and AI-ready Higgs boson decay dataset

Affiliations

A FAIR and AI-ready Higgs boson decay dataset

Yifan Chen et al. Sci Data. .

Abstract

To enable the reusability of massive scientific datasets by humans and machines, researchers aim to adhere to the principles of findability, accessibility, interoperability, and reusability (FAIR) for data and artificial intelligence (AI) models. This article provides a domain-agnostic, step-by-step assessment guide to evaluate whether or not a given dataset meets these principles. We demonstrate how to use this guide to evaluate the FAIRness of an open simulated dataset produced by the CMS Collaboration at the CERN Large Hadron Collider. This dataset consists of Higgs boson decays and quark and gluon background, and is available through the CERN Open Data Portal. We use additional available tools to assess the FAIRness of this dataset, and incorporate feedback from members of the FAIR community to validate our results. This article is accompanied by a Jupyter notebook to visualize and explore this dataset. This study marks the first in a planned series of articles that will guide scientists in the creation of FAIR AI models and datasets in high energy particle physics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
The distribution of labels is shown for a representative file in the training dataset.
Fig. 2
Fig. 2
Illustration of a H(bb¯) jet with two secondary vertices (SVs) from the decay of two b hadrons resulting in charged-particle tracks (including a low-energy, or soft, lepton) that are displaced with respect to the primary collision vertex (PV), and hence with a large impact parameter (IP) value.
Fig. 3
Fig. 3
The distributions of some salient jet features: (a) the soft-drop jet mass; (b) number of particle candidates; (c) number of secondary vertices; and (d) number of tracks, are shown for one file in the training dataset.

References

    1. LeCun Y, Bengio Y, Hinton G. Deep learning. Nat. 2015;521:436. doi: 10.1038/nature14539. - DOI - PubMed
    1. Huerta EA, et al. Enabling real-time multi-messenger astrophysics discoveries with deep learning. Nat Rev. Phys. 2019;1:600. doi: 10.1038/s42254-019-0097-4. - DOI
    1. Deng, J. et al. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248, 10.1109/CVPR.2009.5206848 (2009).
    1. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770, 10.1109/CVPR.2016.90 (2016).
    1. van den Oord, A. et al. WaveNet: A generative model for raw audio. In 9th ISCA Speech Synthesis Workshop, 125 (2016).