A rule-free workflow for the automated generation of databases from scientific literature

Luke P J Gilligan^#¹, Matteo Cobelli^#¹, Valentin Taufour², Stefano Sanvito¹

Affiliations

¹ School of Physics, AMBER and CRANN Institute, Trinity College, Dublin 2, Dublin, Ireland.
² Department of Physics and Astronomy, University of California, Davis, CA 95616 USA.

^# Contributed equally.

PMID: 38666056
PMCID: PMC11041762
DOI: 10.1038/s41524-023-01171-9

A rule-free workflow for the automated generation of databases from scientific literature

Luke P J Gilligan et al. NPJ Comput Mater. 2023.

. 2023;9(1):222.

doi: 10.1038/s41524-023-01171-9. Epub 2023 Dec 13.

Authors

Luke P J Gilligan^#¹, Matteo Cobelli^#¹, Valentin Taufour², Stefano Sanvito¹

Affiliations

¹ School of Physics, AMBER and CRANN Institute, Trinity College, Dublin 2, Dublin, Ireland.
² Department of Physics and Astronomy, University of California, Davis, CA 95616 USA.

^# Contributed equally.

PMID: 38666056
PMCID: PMC11041762
DOI: 10.1038/s41524-023-01171-9

Abstract

In recent times, transformer networks have achieved state-of-the-art performance in a wide range of natural language processing tasks. Here we present a workflow based on the fine-tuning of BERT models for different downstream tasks, which results in the automated extraction of structured information from unstructured natural language in scientific literature. Contrary to existing methods for the automated extraction of structured compound-property relations from similar sources, our workflow does not rely on the definition of intricate grammar rules. Hence, it can be adapted to a new task without requiring extensive implementation efforts and knowledge. We test our data-extraction workflow by automatically generating a database for Curie temperatures and one for band gaps. These are then compared with manually curated datasets and with those obtained with a state-of-the-art rule-based method. Furthermore, in order to showcase the practical utility of the automatically extracted data in a material-design workflow, we employ them to construct machine-learning models to predict Curie temperatures and band gaps. In general, we find that, although more noisy, automatically extracted datasets can grow fast in volume and that such volume partially compensates for the inaccuracy in downstream tasks.

Keywords: Computational methods; Electronic structure.

PubMed Disclaimer

Conflict of interest statement

Competing interestsThe authors declare no competing interests.

Figures

**Fig. 1. Schematic diagram of the BERT-PSIE pipeline for the automated extraction of compound-property pairs from the scientific literature.**
The workflow relies on the combination of BERT models fine-tuned for different downstream tasks such as sentence classification, named entity recognition, and relation classification. Here we use the Curie temperature as an example. See text for more details.

**Fig. 2. Comparison between the content of the different databases: (red box) BERT-PSIE, (blue box) ChemDataExtractor, and (green box) the manually extracted database of ref. .**
a Normalized distribution of the Curie temperatures extracted. A peak is visible in the distribution of ~300 K in both the autonomously extracted databases, which is not seen in the manually extracted one. b Relative elemental abundance across the compounds present in a database. Although there is general agreement among the three databases, additional peaks are observed for various elements in the case of automatically extracted data, which are not present in the manually curated dataset. The most severe of these discrepancies is in the relative abundance of Mn- and O-containing compounds. Note that the automatically extracted datasets and the manually curated ones are based on different literature libraries.

**Fig. 3. Comparison between the T_C distributions of compounds containing the most common elements found in ferromagnets.**
The violin plots display the T_C distribution of the compounds containing specific elements in the dataset automatically generated with BERT-PSIE (red) and ChemDataExtractor (blue), and in the manually curated ground truth (green). Only the most common elements appearing in the datasets are displayed here. The dots show the median of each distribution.

**Fig. 4. The distribution of band-gap values for the five most common chemical formulas found in the database of band-gaps generated by BERT-PSIE.**
The histograms report the relative abundance, while dashed lines indicate gap energies corresponding to specific experimental measurements or theoretical calculations.

**Fig. 5. Query-test for the BERT-PSIE-generated T_C dataset.**
a Comparison between the T_C queried in the dataset automatically generated by BERT-PSIE and the values contained in the manually curated dataset. The comparison is performed over the 262 compounds that are shared by all datasets examined in this work. The median value is returned whenever multiple T_C values are collected for a given compound. b The same comparison is performed on the dataset resulting by combining the one generated by BERT-PSIE and the one generated by ChemDataExtractor.

**Fig. 6. Performance of a random-forest model for T_C trained over automatically extracted databases.**
Parity plot (predicted T_C vs manually extracted T_C) for the best RF compositional model constructed a on the BERT-PSIE dataset and b on the combined BERT-PSIE and ChemDataExtractor dataset. The test set consists of the 2623 compounds that are not present in any of the automatically generated datasets considered in this work but for which we have a T_C manually extracted from the scientific literature.

**Fig. 7. Evaluation of the RF models constructed on automatically extracted datasets as classifiers.**
a Violin plots showing the T_C distributions of the compounds screened using an RF model trained on the BERT-PSIE data and compared with the manually extracted values. The dashed line is the parity line highlighting how the median of the screened distribution increases as the screening threshold increases. Despite a low recall, the precision is high enough to select compounds likely to have a T_C higher than a given threshold. The screening is done on compounds not present in the training set of the RF. b The same test is performed by training an RF model on the combination of the BERT-PSIE and ChemDataExtractor datasets.

See this image and copyright information in PMC

References

1. Bornmann L, Haunschild R, Mutz R. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanit. Soc. Sci. Commun. 2021;8:224. doi: 10.1057/s41599-021-00903-w. - DOI
1. Curtarolo S, et al. Aflowlib.org: a distributed materials properties repository from high-throughput ab initio calculations. Comput. Mater. Sci. 2012;58:227–235. doi: 10.1016/j.commatsci.2012.02.002. - DOI
1. Talirz L, et al. Materials cloud, a platform for open computational science. Sci. Data. 2020;7:299. doi: 10.1038/s41597-020-00637-5. - DOI - PMC - PubMed
1. Jain A, et al. Commentary: the materials project: a materials genome approach to accelerating materials innovation. APL Mater. 2013;1:011002. doi: 10.1063/1.4812323. - DOI
1. Kirklin S, et al. The open quantum materials database (OQMD): assessing the accuracy of DFT formation energies. npj Comput. Mater. 2015;1:15010. doi: 10.1038/npjcompumats.2015.10. - DOI

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A rule-free workflow for the automated generation of databases from scientific literature

Affiliations

A rule-free workflow for the automated generation of databases from scientific literature

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources