Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 18;57(Pt 4):975-985.
doi: 10.1107/S1600576724004497. eCollection 2024 Aug 1.

Accurate space-group prediction from composition

Affiliations

Accurate space-group prediction from composition

Vishwesh Venkatraman et al. J Appl Crystallogr. .

Abstract

Predicting crystal symmetry simply from chemical composition has remained challenging. Several machine-learning approaches can be employed, but the predictive value of popular crystallographic databases is relatively modest due to the paucity of data and uneven distribution across the 230 space groups. In this work, virtually all crystallographic information available to science has been compiled and used to train and test multiple machine-learning models. Composition-driven random-forest classification relying on a large set of descriptors showed the best performance. The predictive models for crystal system, Bravais lattice, point group and space group of inorganic compounds are made publicly available as easy-to-use software downloadable from https://gitlab.com/vishsoft/cosy.

Keywords: data sets; machine learning; prediction; random forests; space groups.

PubMed Disclaimer

Figures

Figure 1
Figure 1
(a) Distribution of unique compounds in the databases. (b) Pairwise intersections. (c1) Element-containing compounds in each database, where O|H|C|N represents compounds containing at least one of these elements and /O/H/C/N represents compounds containing none of these elements. (c2) Magnified detail of (c1).
Figure 2
Figure 2
(a) Compound distribution across the crystal systems in all databases. (b) Distribution of compounds across the Bravais lattices and lattice centering types in MERGED. (c) Contribution of each primary database to each point-group class.
Figure 3
Figure 3
Distribution of unique compounds across the 230 space groups in the data sets.
Figure 4
Figure 4
Space-group distribution in MERGED. The labels over frequent classes indicate the space-group number followed by the corresponding percentage. The number of compounds for each space group is listed in Table S2 in the SI.
Figure 5
Figure 5
Top-k accuracies for the test set (averaged over three independent splits) obtained by the different ML approaches. Values in brackets indicate the number of classes associated with each response.
Figure 6
Figure 6
Heatmap showing the per-class sensitivity and specificity of the RF model for the 172 space-group test set.
Figure 7
Figure 7
Top-k accuracies of the RF-based models for two independent test sets: (i) the AMCSD (Downs & Hall-Wallace, 2003 ▸) data set containing 8253 compounds and (ii) the HEAC data set comprising 125 compounds.
Figure 8
Figure 8
Confusion matrices for symmetry prediction using RF-based models trained on all databases (top-1 accuracy). (a) Point groups of the compounds in AMCSD. (b) Space groups of the compounds in AMCSD (for visualization clarity the RF-based models were trained only on the top 46 space groups of each database). Additional details can be found in Figs. F3 and F4 in the SI.
Figure 9
Figure 9
Variable importance in the RF models for lattice centering, crystal system, Bravais lattice, point group and space group. For brevity, only the ten most influential variables are shown for each symmetry category. The length of the bars is a quantitative measure of the decrease in accuracy upon removal of the variable from the set of descriptors. Gaps, i.e. absence, of a bar for a variable in a symmetry category indicate that the variable is not among the top-ten contributors. More information on each descriptor can be found in the SI.

References

    1. Aguiar, J. A., Gong, M. L. & Tasdizen, T. (2020). Comput. Mater. Sci.173, 109409.
    1. Allahyari, Z. & Oganov, A. R. (2020). J. Phys. Chem. C, 124, 23867–23878.
    1. Alsaui, A., Alqahtani, S. M., Mumtaz, F., Ibrahim, A. G., Mohammed, A., Muqaibel, A. H., Rashkeev, S. N., Baloch, A. A. B. & Alharbi, F. H. (2022). Sci. Rep.12, 1577. - PMC - PubMed
    1. Arik, S. O. & Pfister, T. (2019). arXiv:1908.07442.
    1. Artstein, R. & Poesio, M. (2008). Comput. Linguist.34, 555–596.

LinkOut - more resources