Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 1;7(1):134.
doi: 10.1038/s41597-020-0473-z.

The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules

Affiliations

The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules

Justin S Smith et al. Sci Data. .

Abstract

Maximum diversification of data is a central theme in building generalized and accurate machine learning (ML) models. In chemistry, ML has been used to develop models for predicting molecular properties, for example quantum mechanics (QM) calculated potential energy surfaces and atomic charge models. The ANI-1x and ANI-1ccx ML-based general-purpose potentials for organic molecules were developed through active learning; an automated data diversification process. Here, we describe the ANI-1x and ANI-1ccx data sets. To demonstrate data diversity, we visualize it with a dimensionality reduction scheme, and contrast against existing data sets. The ANI-1x data set contains multiple QM properties from 5 M density functional theory calculations, while the ANI-1ccx data set contains 500 k data points obtained with an accurate CCSD(T)/CBS extrapolation. Approximately 14 million CPU core-hours were expended to generate this data. Multiple QM calculated properties for the chemical elements C, H, N, and O are provided: energies, atomic forces, multipole moments, atomic charges, etc. We provide this data to the community to aid research and development of ML models for chemistry.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Active learning schemes for building ANI data sets. (a) The active learning algorithm employed during the construction of the ANI-1x data set. (b) The ANI-1ccx selection and data generation scheme.
Fig. 2
Fig. 2
2D parametric t-SNE embeddings. These embeddings are for the 1st layer of activations of the ANI-1x model for the complete QM9 data set and random subsets of the ANI-1, ANI-1x and ANI-1ccx data sets. The same number of atoms are compared for each element. The different colors correspond to the number and type of bonded neighbors.
Fig. 3
Fig. 3
ANI data set energy and size distribution. (a) A histogram of the potential energies in the ANI-1x and ANI-1ccx data sets with a linear fit per atomic element Es removed. The bin width is 1 millihartree. (b) A histogram of the total number of atoms (including C, H, N, and O atoms) per molecule in the ANI-1x and ANI-1ccx data sets. The bin width is one.

References

    1. Gandhi, D., Pinto, L. & Gupta, A. Learning to fly by crashing. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 3948–3955 (IEEE, 2017).
    1. Settles B. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. 2012;18:1–111. doi: 10.2200/S00429ED1V01Y201207AIM018. - DOI
    1. Reker, D. & Schneider, G. Active-learning strategies in computer-assisted drug discovery, vol. 20 (Elsevier Current Trends, 2015). - PubMed
    1. Podryabinkin EV, Shapeev AV. Active learning of linearly parametrized interatomic potentials. Computational Materials Science. 2017;140:171–180. doi: 10.1016/j.commatsci.2017.08.031. - DOI
    1. Smith JS, Nebgen B, Lubbers N, Isayev O, Roitberg AE. Less is more: sampling chemical space with active learning. The Journal of Chemical Physics. 2018;148:241733. doi: 10.1063/1.5023802. - DOI - PubMed

Publication types