RadBERT: Adapting Transformer-based Language Models to Radiology

An Yan¹, Julian McAuley¹, Xing Lu¹, Jiang Du¹, Eric Y Chang¹, Amilcare Gentili¹, Chun-Nan Hsu¹

Affiliations

Affiliation

¹ University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093-0608 (A.Y., J.M., X.L., J.D., E.Y.C., A.G., C.N.H.); and Veterans Affairs San Diego Healthcare System, San Diego, Calif (E.Y.C., A.G.).

PMID: 35923376
PMCID: PMC9344353
DOI: 10.1148/ryai.210258

RadBERT: Adapting Transformer-based Language Models to Radiology

An Yan et al. Radiol Artif Intell. 2022.

. 2022 Jun 15;4(4):e210258.

doi: 10.1148/ryai.210258. eCollection 2022 Jul.

Authors

An Yan¹, Julian McAuley¹, Xing Lu¹, Jiang Du¹, Eric Y Chang¹, Amilcare Gentili¹, Chun-Nan Hsu¹

Affiliation

¹ University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093-0608 (A.Y., J.M., X.L., J.D., E.Y.C., A.G., C.N.H.); and Veterans Affairs San Diego Healthcare System, San Diego, Calif (E.Y.C., A.G.).

PMID: 35923376
PMCID: PMC9344353
DOI: 10.1148/ryai.210258

Abstract

Purpose: To investigate if tailoring a transformer-based language model to radiology is beneficial for radiology natural language processing (NLP) applications.

Materials and methods: This retrospective study presents a family of bidirectional encoder representations from transformers (BERT)-based language models adapted for radiology, named RadBERT. Transformers were pretrained with either 2.16 or 4.42 million radiology reports from U.S. Department of Veterans Affairs health care systems nationwide on top of four different initializations (BERT-base, Clinical-BERT, robustly optimized BERT pretraining approach [RoBERTa], and BioMed-RoBERTa) to create six variants of RadBERT. Each variant was fine-tuned for three representative NLP tasks in radiology: (a) abnormal sentence classification: models classified sentences in radiology reports as reporting abnormal or normal findings; (b) report coding: models assigned a diagnostic code to a given radiology report for five coding systems; and (c) report summarization: given the findings section of a radiology report, models selected key sentences that summarized the findings. Model performance was compared by bootstrap resampling with five intensively studied transformer language models as baselines: BERT-base, BioBERT, Clinical-BERT, BlueBERT, and BioMed-RoBERTa.

Results: For abnormal sentence classification, all models performed well (accuracies above 97.5 and F1 scores above 95.0). RadBERT variants achieved significantly higher scores than corresponding baselines when given only 10% or less of 12 458 annotated training sentences. For report coding, all variants outperformed baselines significantly for all five coding systems. The variant RadBERT-BioMed-RoBERTa performed the best among all models for report summarization, achieving a Recall-Oriented Understudy for Gisting Evaluation-1 score of 16.18 compared with 15.27 by the corresponding baseline (BioMed-RoBERTa, P < .004).

Conclusion: Transformer-based language models tailored to radiology had improved performance of radiology NLP tasks compared with baseline transformer language models.Keywords: Translation, Unsupervised Learning, Transfer Learning, Neural Networks, Informatics Supplemental material is available for this article. © RSNA, 2022See also commentary by Wiggins and Tejani in this issue.

Keywords: Informatics; Neural Networks; Transfer Learning; Translation; Unsupervised Learning.

PubMed Disclaimer

Conflict of interest statement

Disclosures of conflicts of interest: A.Y. No relevant relationships. J.M. No relevant relationships. X.L. No relevant relationships. J.D. No relevant relationships. E.Y.C. No relevant relationships. A.G. Department of Defense grant paid to UCSD (covers a small percentage of the author's salary). C.N.H. No relevant relationships.

Figures

Overview of our study design, which includes pretraining and
fine-tuning of RadBERT. (A) In pretraining, different weight initializations
were considered to create variants of RadBERT. (B) The variants were
fine-tuned for three important radiology natural language processing (NLP)
tasks: abnormal sentence classification, report coding, and report
summarization. The performance of RadBERT variants for these tasks was
compared with a set of intensively studied transformer-based language models
as baselines. (C) Examples of each task and how performance was measured. In
the abnormality identification task, a sentence in a radiology report was
considered “abnormal” if it reported an abnormal finding and
“normal” otherwise. A human-annotated abnormality was
considered ground truth to evaluate the performance of an NLP model. In the
code classification task, models were expected to output diagnostic codes
(eg, abdominal aortic aneurysm, Breast Imaging Reporting and Data System
[BI-RADS], and Lung Imaging Reporting and Data System [Lung-RADS]) that
match the codes given by human providers as the ground truth for a given
radiology report. During report summarization, the models generated a short
summary given the findings in a radiology report. Summary quality was
measured by how similar it was to the impression section of the input
report. AAA = abdominal aortic aneurysm, BERT = bidirectional encoder
representations from transformers, RadBERT = BERT-based language model
adapted for radiology, RoBERTa = robustly optimized BERT pretraining
approach. — **Figure 1:**
Overview of our study design, which includes pretraining and fine-tuning of RadBERT. **(A)** In pretraining, different weight initializations were considered to create variants of RadBERT. **(B)** The variants were fine-tuned for three important radiology natural language processing (NLP) tasks: abnormal sentence classification, report coding, and report summarization. The performance of RadBERT variants for these tasks was compared with a set of intensively studied transformer-based language models as baselines. **(C)** Examples of each task and how performance was measured. In the abnormality identification task, a sentence in a radiology report was considered “abnormal” if it reported an abnormal finding and “normal” otherwise. A human-annotated abnormality was considered ground truth to evaluate the performance of an NLP model. In the code classification task, models were expected to output diagnostic codes (eg, abdominal aortic aneurysm, Breast Imaging Reporting and Data System [BI-RADS], and Lung Imaging Reporting and Data System [Lung-RADS]) that match the codes given by human providers as the ground truth for a given radiology report. During report summarization, the models generated a short summary given the findings in a radiology report. Summary quality was measured by how similar it was to the impression section of the input report. AAA = abdominal aortic aneurysm, BERT = bidirectional encoder representations from transformers, RadBERT = BERT-based language model adapted for radiology, RoBERTa = robustly optimized BERT pretraining approach.

Distribution histograms of (A) the length of the entire 1000 radiology
reports evaluated in task 3 and (B) their impression sections. Density is
the proportion of reports or impressions (ie, documents) with a certain
length. Length is the number of words in a report or impression (words per
document). (C) The ratio between one report's length and its
corresponding impression mostly lies in the range from 2 to 6. The average
number of words in a sentence in these reports is 11. The histograms may be
used to estimate the number of sentences that need to be extracted from a
report as a summary to match the impressions. — **Figure 2:**
Distribution histograms of **(A)** the length of the entire 1000 radiology reports evaluated in task 3 and **(B)** their impression sections. Density is the proportion of reports or impressions (ie, documents) with a certain length. Length is the number of words in a report or impression (words per document). **(C)** The ratio between one report's length and its corresponding impression mostly lies in the range from 2 to 6. The average number of words in a sentence in these reports is 11. The histograms may be used to estimate the number of sentences that need to be extracted from a report as a summary to match the impressions.

Confusion matrices for report coding with two language models
(BERT-base and RadBERT-RoBERTa) fine-tuned to assign diagnostic codes in two
coding systems (Lung Imaging Reporting and Data System [Lung-RADS] and
abnormal) (see Appendix E4 [supplement]). (A, B) The Lung-RADS dataset
consisted of six categories: “incomplete,” “benign
nodule appearance or behavior,” “probably benign
nodule,” “suspicious nodule-a,” “suspicious
nodule-b,” and “prior lung cancer,” denoted as numbers
1 to 6 in the figure. (C, D) The abnormal dataset also consisted of six
categories: “major abnormality,” “no attn
needed,” “major abnormality, physician aware,”
“minor abnormality,” “possible malignancy,”
“significant abnormality, attn needed,” and
“normal.” The figures show that RadBERT-RoBERTa improved from
BERT-base by better distinguishing code numbers 5 and 6 for Lung-RADS and
making fewer errors for code number 1 of the abnormal dataset. BERT =
bidirectional encoder representations from transformers, RadBERT =
BERT-based language model adapted for radiology, RoBERTa = robustly
optimized BERT pretraining approach. — **Figure 3:**
Confusion matrices for report coding with two language models (BERT-base and RadBERT-RoBERTa) fine-tuned to assign diagnostic codes in two coding systems (Lung Imaging Reporting and Data System [Lung-RADS] and abnormal) (see Appendix E4 [supplement]). **(A, B)** The Lung-RADS dataset consisted of six categories: “incomplete,” “benign nodule appearance or behavior,” “probably benign nodule,” “suspicious nodule-a,” “suspicious nodule-b,” and “prior lung cancer,” denoted as numbers 1 to 6 in the figure. **(C, D)** The abnormal dataset also consisted of six categories: “major abnormality,” “no attn needed,” “major abnormality, physician aware,” “minor abnormality,” “possible malignancy,” “significant abnormality, attn needed,” and “normal.” The figures show that RadBERT-RoBERTa improved from BERT-base by better distinguishing code numbers 5 and 6 for Lung-RADS and making fewer errors for code number 1 of the abnormal dataset. BERT = bidirectional encoder representations from transformers, RadBERT = BERT-based language model adapted for radiology, RoBERTa = robustly optimized BERT pretraining approach.

Examples of (A) three high-scoring predicted summarizations and (B)
three low-scoring predicted summarizations by RadBERT–BioMed-RoBERTa,
the best-performing RadBERT model, and their corresponding ground truths
(the impression section of input reports). BERT = bidirectional encoder
representations from transformers, RadBERT = BERT-based language model
adapted for radiology, RoBERTa = robustly optimized BERT pretraining
approach. — **Figure 4:**
Examples of **(A)** three high-scoring predicted summarizations and **(B)** three low-scoring predicted summarizations by RadBERT–BioMed-RoBERTa, the best-performing RadBERT model, and their corresponding ground truths (the impression section of input reports). BERT = bidirectional encoder representations from transformers, RadBERT = BERT-based language model adapted for radiology, RoBERTa = robustly optimized BERT pretraining approach.

See this image and copyright information in PMC

References

1. Vaswani A , Shazeer N , Parmar N , et al. . Attention is all you need . In: Advances in Neural Information Processing Systems 30 (NIPS 2017) , 2017. ; 5998 – 6008 . https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-... .
1. Devlin J , Chang MW , Lee K , Toutanova K . BERT: Pretraining of deep bidirectional transformers for language understanding . arXiv:1810.04805 [preprint] https://arxiv.org/abs/1810.04805. Posted October 11, 2018. Accessed June 7, 2022 .
1. Liu Y , Ott M , Goyal N , et al. . RoBERTa: A robustly optimized BERT pretraining approach . arXiv:1907.11692 [preprint] https://arxiv.org/abs/1907.11692. Posted July 26, 2019. Accessed June 7, 2022 .
1. Mikolov T , Chen K , Corrado G , Dean J . Efficient estimation of word representations in vector space . arXiv:1301.3781 [preprint] https://arxiv.org/abs/1301.3781. Posted January 16, 2013. Accessed June 7, 2022 .
1. Lee J , Yoon W , Kim S , et al. . BioBERT: a pre-trained biomedical language representation model for biomedical text mining . Bioinformatics 2020. ; 36 ( 4 ): 1234 – 1240 . - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

RadBERT: Adapting Transformer-based Language Models to Radiology

Affiliation

RadBERT: Adapting Transformer-based Language Models to Radiology

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources