MaTableGPT: GPT-Based Table Data Extractor from Materials Science Literature

Gyeong Hoon Yi^{1

2}, Jiwoo Choi^{1

2}, Hyeongyun Song¹, Olivia Miano³, Jaewoong Choi¹, Kihoon Bang¹, Byungju Lee¹, Seok Su Sohn², David Buttler⁴, Anna Hiszpanski⁵, Sang Soo Han¹, Donghun Kim¹

Affiliations

¹ Computational Science Research Center, Korea Institute of Science and Technology, Seoul, 02792, Republic of Korea.
² Department of Materials Science and Engineering, Korea University, Seoul, 02841, Republic of Korea.
³ Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, CA, 94550, USA.
⁴ Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA, 94550, USA.
⁵ Materials Science Division, Lawrence Livermore National Laboratory, Livermore, CA, 94550, USA.

PMID: 39853928
PMCID: PMC12021050
DOI: 10.1002/advs.202408221

MaTableGPT: GPT-Based Table Data Extractor from Materials Science Literature

Gyeong Hoon Yi et al. Adv Sci (Weinh). 2025 Apr.

. 2025 Apr;12(16):e2408221.

doi: 10.1002/advs.202408221. Epub 2025 Jan 24.

Authors

Affiliations

¹ Computational Science Research Center, Korea Institute of Science and Technology, Seoul, 02792, Republic of Korea.
² Department of Materials Science and Engineering, Korea University, Seoul, 02841, Republic of Korea.
³ Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, CA, 94550, USA.
⁴ Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA, 94550, USA.
⁵ Materials Science Division, Lawrence Livermore National Laboratory, Livermore, CA, 94550, USA.

PMID: 39853928
PMCID: PMC12021050
DOI: 10.1002/advs.202408221

Abstract

Efficiently extracting data from tables in the scientific literature is pivotal for building large-scale databases. However, the tables reported in materials science papers exist in highly diverse forms; thus, rule-based extractions are an ineffective approach. To overcome this challenge, the study presents MaTableGPT, which is a GPT-based table data extractor from the materials science literature. MaTableGPT features key strategies of table data representation and table splitting for better GPT comprehension and filtering hallucinated information through follow-up questions. When applied to a vast volume of water splitting catalysis literature, MaTableGPT achieves an extraction accuracy (total F1 score) of up to 96.8%. Through comprehensive evaluations of the GPT usage cost, labeling cost, and extraction accuracy for the learning methods of zero-shot, few-shot, and fine-tuning, the study presents a Pareto-front mapping where the few-shot learning method is found to be the most balanced solution owing to both its high extraction accuracy (total F1 score >95%) and low cost (GPT usage cost of 5.97 US dollars and labeling cost of 10 I/O paired examples). The statistical analyses conducted on the database generated by MaTableGPT revealed valuable insights into the distribution of the overpotential and elemental utilization across the reported catalysts in the water splitting literature.

Keywords: GPT; large language models; literature mining; machine learning; materials science; table data extraction; water splitting catalysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
Overall workflow of MaTableGPT. 1) Step 1: collection of papers relevant to oxygen evolution reaction (OER) and excluding noise papers. 2) Step 2: identification of tables containing the catalytic performance data. 3) Step 3: table data representation to aid GPT's comprehension. 4) Step 4: GPT training with zero‐shot, few‐shot, and fine‐tuning approaches. 5) Step 5: follow‐up questions to reduce hallucinatory data. 6) Step 6: building the database using the pretrained MaTableGPT model.

**Figure 2**
Examples of various formats of tables reported in the materials science literature. a) Table with a multilayered header of four rows and merged cells.^[ ²⁴ ^] b) Multi‐header table with sub‐headers named the hydrogen evolution reaction (HER) and OER, each with two rows detailing catalyst performance.^[ ²⁵ ^] c) Table including a caption and its index. The yellow frames outside the table explain the meaning of the caption index, which is denoted as the Greek letter or abbreviation.^[ ²⁶ ^] d) Transposed table with rows and columns reversed compared to standard tables with catalyst names written in the leftmost column.^[ ²⁷ ^]

**Figure 3**
Table data representation from the HTML format to customized JSON or TSV formats for effective GPT comprehension. a) Conversion rules from the HTML tags to customized JSON or TSV formats. b–e) Examples of a (b) raw table and its representations in (c) HTML, (c) customized JSON format, and (d) customized TSV format. f) Example of the final outcome (in JSON) produced by the GPT predictions.

**Figure 4**
Processes and rationale behind table splitting. a) Table splitting rules for a typical table. b) Examples of atypical tables with header complexities and rules for table splitting therein.

**Figure 5**
Schematic showing the GPT modeling process based on fine‐tuning, few‐shot, and zero‐shot learning schemes. The extracted databases obtained by the three learning methods are in JSON template format, and each database is subjected to conversational follow‐up questions to minimize hallucinations. In this schematic, customized TSV, instead of customized JSON, is chosen as an example for the GPT input format for clarity. GPT‐3.5 is used for fine‐tuning, and GPT‐4 is used for both few‐shot and zero‐shot learning since GPT‐4 is not yet available for fine‐tuning as of May 2024.

**Figure 6**
Performance comparison of the MaTableGPT models with various parameters of input formats, learning methods, and table splitting. The input formats of customized TSV, customized JSON, and the baseline are considered. Baseline denotes the case of the original HTML format and table non‐splitting. The learning methods of fine‐tuning, few‐shot learning, and zero‐shot learning are considered. Non‐split refers to the table inputs that have not been split, while split denotes inputs where the table has been split. The metrics of the structure F1 score, value accuracy, and total F1 score are used.

**Figure 7**
MaTableGPT accuracy‐cost map. A map providing an overview of GPT usage cost, labeling cost, and performance (total F1 score) across the various input formats and models. For the labeling cost, the size of the circle represents the relative size of the labeling set, with larger circles indicating a larger labeling size. For fine‐tuning, the GPT cost is calculated based on all tokens used in training and test inputs and outputs. For few‐shot learning, the cost includes tokens from the task description, 10‐shot examples, and their outputs. For zero‐shot learning and follow‐up questions, the cost includes tokens from the task description and outputs. For cases involving table splitting, the training set (used only for fine‐tuning) contains 1055 tables, while the test set contains 293 tables. For cases not involving table splitting, the training set for fine‐tuning contains 126 tables, and the test set contains 35 tables. Details of each GPT cost can be found in Note S5 (Supporting Information).

**Figure 8**
Distribution of the overpotentials under different electrolyte conditions and different current densities. a) Distribution of the overpotentials for each OER and HER under different electrolyte conditions (acidic and alkaline). b) Distribution of the overpotentials for each OER and HER at different current densities (10 and 100 mA cm⁻²). Y‐axis represents count normalized by dividing by the total sum of counts.

**Figure 9**
Elemental utilization across different electrolyte environments. a) Heatmap of the most commonly used elements in OER catalysts in each acidic and alkaline medium. b–g) Visualization of the ARM results. b) Entire graph of the ARM results for an acidic medium. d) Subgraph highlighting the nodes that are connected to Ir for acidic media. d) Subgraph highlighting the nodes that are connected to Ru for acidic media. e) Entire graph of the ARM results for alkaline media. f) Subgraph highlighting the nodes that are connected to Ni for alkaline media. g) Subgraph highlighting the nodes that are connected to Co for alkaline media.

See this image and copyright information in PMC

References

1. Himanen L., Geurts A., Foster A. S., Rinke P., Adv. Sci. 2019, 6, 1900808. - PMC - PubMed
1. Cole J. M., Acc. Chem. Res. 2020, 53, 599. - PubMed
1. Ramprasad R., Batra R., Pilania G., Mannodi‐Kanakkithodi A., Kim C., npj Comput. Mater. 2017, 3, 54.
1. Jain A., Hautier G., Ong S. P., Persson K., J. Mater. Res. 2016, 31, 977.
1. Jain A., Ong S. P., Hautier G., Chen W., Richards W. D., Dacek S., Cholia S., Gunter D., Skinner D., Ceder G., APL Mater. 2013, 1, 011002.

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MaTableGPT: GPT-Based Table Data Extractor from Materials Science Literature

Affiliations

MaTableGPT: GPT-Based Table Data Extractor from Materials Science Literature

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources