Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Aug;22(4):348-56.
doi: 10.1007/s10278-008-9110-7. Epub 2008 Apr 5.

Development of a Google-based search engine for data mining radiology reports

Affiliations

Development of a Google-based search engine for data mining radiology reports

Joseph P Erinjeri et al. J Digit Imaging. 2009 Aug.

Abstract

The aim of this study is to develop a secure, Google-based data-mining tool for radiology reports using free and open source technologies and to explore its use within an academic radiology department. A Health Insurance Portability and Accountability Act (HIPAA)-compliant data repository, search engine and user interface were created to facilitate treatment, operations, and reviews preparatory to research. The Institutional Review Board waived review of the project, and informed consent was not required. Comprising 7.9 GB of disk space, 2.9 million text reports were downloaded from our radiology information system to a fileserver. Extensible markup language (XML) representations of the reports were indexed using Google Desktop Enterprise search engine software. A hypertext markup language (HTML) form allowed users to submit queries to Google Desktop, and Google's XML response was interpreted by a practical extraction and report language (PERL) script, presenting ranked results in a web browser window. The query, reason for search, results, and documents visited were logged to maintain HIPAA compliance. Indexing averaged approximately 25,000 reports per hour. Keyword search of a common term like "pneumothorax" yielded the first ten most relevant results of 705,550 total results in 1.36 s. Keyword search of a rare term like "hemangioendothelioma" yielded the first ten most relevant results of 167 total results in 0.23 s; retrieval of all 167 results took 0.26 s. Data mining tools for radiology reports will improve the productivity of academic radiologists in clinical, educational, research, and administrative tasks. By leveraging existing knowledge of Google's interface, radiologists can quickly perform useful searches.

PubMed Disclaimer

Figures

Fig 1
Fig 1
Radsearch schematic. A radiologist submits a query via an HTML form, which is interpreted by a PERL CGI script running on the web server. The CGI prepares an HTTP request to the search engine, and the search engine’s XML response is interpreted by the same script. The user information and search terms are logged, and the hits and snippets are returned to the radiologist’s web browser.
Fig 2
Fig 2
Text and XML radiology reports. Text radiology reports (a) are converted to XML representations (b) prior to indexing by the search engine. By placing field tags (e.g., “<exam>”) adjacent to terms (e.g., “chest”) within the XML documents, text within different fields can be identified independently. A query of “<exam> chest” would yield documents where the examination was a chest X-ray, whereas a query of “chest” would identify documents where the word chest appeared anywhere in the report (e.g., “chest pain”). To maintain patient confidentiality within this figure, PHI has been anonymized (shaded in gray).
Fig 3
Fig 3
Radsearch user interface. a Search form. To perform a search, a radiologist must fill in the search terms, username, password, and reason for search. Users can specify which results to show (e.g., patient records, contact info, presentations) as well as output format (snippets or list). b Results display. The number of results, duration of search, links to matching radiology reports, and snippets are displayed for each search. Additional links allow for highlighting, anonymization, and display of XML documents. To maintain patient confidentiality within this figure, PHI has been anonymized (shaded in gray).

Similar articles

Cited by

References

    1. Iwata S, Chen RS. Science and the digital divide. Science. 2005;310:405. doi: 10.1126/science.1119500. - DOI - PubMed
    1. Thrall JH. Reinventing radiology in the digital age: part I. The all-digital department. Radiology. 2005;236:382–385. doi: 10.1148/radiol.2362050257. - DOI - PubMed
    1. Hynes DM, Stevenson G, Nahmias C. Towards filmless and distance radiology. Lancet. 1997;350:657–660. doi: 10.1016/S0140-6736(97)08157-9. - DOI - PubMed
    1. Tamm EP, Kawashima A, Silverman P. An academic radiology information system (RIS): a review of the commercial RIS systems, and how an individualized academic RIS can be created and utilized. J Digit Imaging. 2001;14:131–134. doi: 10.1007/s10278-001-0012-1. - DOI - PMC - PubMed
    1. Thrall JH. Reinventing radiology in the digital age. Part II. New directions and new stakeholder value. Radiology. 2005;237:15–18. doi: 10.1148/radiol.2371050258. - DOI - PubMed

LinkOut - more resources