Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023;9(1):222.
doi: 10.1038/s41524-023-01171-9. Epub 2023 Dec 13.

A rule-free workflow for the automated generation of databases from scientific literature

Affiliations

A rule-free workflow for the automated generation of databases from scientific literature

Luke P J Gilligan et al. NPJ Comput Mater. 2023.

Abstract

In recent times, transformer networks have achieved state-of-the-art performance in a wide range of natural language processing tasks. Here we present a workflow based on the fine-tuning of BERT models for different downstream tasks, which results in the automated extraction of structured information from unstructured natural language in scientific literature. Contrary to existing methods for the automated extraction of structured compound-property relations from similar sources, our workflow does not rely on the definition of intricate grammar rules. Hence, it can be adapted to a new task without requiring extensive implementation efforts and knowledge. We test our data-extraction workflow by automatically generating a database for Curie temperatures and one for band gaps. These are then compared with manually curated datasets and with those obtained with a state-of-the-art rule-based method. Furthermore, in order to showcase the practical utility of the automatically extracted data in a material-design workflow, we employ them to construct machine-learning models to predict Curie temperatures and band gaps. In general, we find that, although more noisy, automatically extracted datasets can grow fast in volume and that such volume partially compensates for the inaccuracy in downstream tasks.

Keywords: Computational methods; Electronic structure.

PubMed Disclaimer

Conflict of interest statement

Competing interestsThe authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Schematic diagram of the BERT-PSIE pipeline for the automated extraction of compound-property pairs from the scientific literature.
The workflow relies on the combination of BERT models fine-tuned for different downstream tasks such as sentence classification, named entity recognition, and relation classification. Here we use the Curie temperature as an example. See text for more details.
Fig. 2
Fig. 2. Comparison between the content of the different databases: (red box) BERT-PSIE, (blue box) ChemDataExtractor, and (green box) the manually extracted database of ref. .
a Normalized distribution of the Curie temperatures extracted. A peak is visible in the distribution of ~300 K in both the autonomously extracted databases, which is not seen in the manually extracted one. b Relative elemental abundance across the compounds present in a database. Although there is general agreement among the three databases, additional peaks are observed for various elements in the case of automatically extracted data, which are not present in the manually curated dataset. The most severe of these discrepancies is in the relative abundance of Mn- and O-containing compounds. Note that the automatically extracted datasets and the manually curated ones are based on different literature libraries.
Fig. 3
Fig. 3. Comparison between the TC distributions of compounds containing the most common elements found in ferromagnets.
The violin plots display the TC distribution of the compounds containing specific elements in the dataset automatically generated with BERT-PSIE (red) and ChemDataExtractor (blue), and in the manually curated ground truth (green). Only the most common elements appearing in the datasets are displayed here. The dots show the median of each distribution.
Fig. 4
Fig. 4. The distribution of band-gap values for the five most common chemical formulas found in the database of band-gaps generated by BERT-PSIE.
The histograms report the relative abundance, while dashed lines indicate gap energies corresponding to specific experimental measurements or theoretical calculations.
Fig. 5
Fig. 5. Query-test for the BERT-PSIE-generated TC dataset.
a Comparison between the TC queried in the dataset automatically generated by BERT-PSIE and the values contained in the manually curated dataset. The comparison is performed over the 262 compounds that are shared by all datasets examined in this work. The median value is returned whenever multiple TC values are collected for a given compound. b The same comparison is performed on the dataset resulting by combining the one generated by BERT-PSIE and the one generated by ChemDataExtractor.
Fig. 6
Fig. 6. Performance of a random-forest model for TC trained over automatically extracted databases.
Parity plot (predicted TC vs manually extracted TC) for the best RF compositional model constructed a on the BERT-PSIE dataset and b on the combined BERT-PSIE and ChemDataExtractor dataset. The test set consists of the 2623 compounds that are not present in any of the automatically generated datasets considered in this work but for which we have a TC manually extracted from the scientific literature.
Fig. 7
Fig. 7. Evaluation of the RF models constructed on automatically extracted datasets as classifiers.
a Violin plots showing the TC distributions of the compounds screened using an RF model trained on the BERT-PSIE data and compared with the manually extracted values. The dashed line is the parity line highlighting how the median of the screened distribution increases as the screening threshold increases. Despite a low recall, the precision is high enough to select compounds likely to have a TC higher than a given threshold. The screening is done on compounds not present in the training set of the RF. b The same test is performed by training an RF model on the combination of the BERT-PSIE and ChemDataExtractor datasets.

References

    1. Bornmann L, Haunschild R, Mutz R. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanit. Soc. Sci. Commun. 2021;8:224. doi: 10.1057/s41599-021-00903-w. - DOI
    1. Curtarolo S, et al. Aflowlib.org: a distributed materials properties repository from high-throughput ab initio calculations. Comput. Mater. Sci. 2012;58:227–235. doi: 10.1016/j.commatsci.2012.02.002. - DOI
    1. Talirz L, et al. Materials cloud, a platform for open computational science. Sci. Data. 2020;7:299. doi: 10.1038/s41597-020-00637-5. - DOI - PMC - PubMed
    1. Jain A, et al. Commentary: the materials project: a materials genome approach to accelerating materials innovation. APL Mater. 2013;1:011002. doi: 10.1063/1.4812323. - DOI
    1. Kirklin S, et al. The open quantum materials database (OQMD): assessing the accuracy of DFT formation energies. npj Comput. Mater. 2015;1:15010. doi: 10.1038/npjcompumats.2015.10. - DOI