Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Feb 5:3:4.
doi: 10.1186/1752-153X-3-4.

Automated extraction of chemical structure information from digital raster images

Affiliations

Automated extraction of chemical structure information from digital raster images

Jungkap Park et al. Chem Cent J. .

Abstract

Background: To search for chemical structures in research articles, diagrams or text representing molecules need to be translated to a standard chemical file format compatible with cheminformatic search engines. Nevertheless, chemical information contained in research articles is often referenced as analog diagrams of chemical structures embedded in digital raster images. To automate analog-to-digital conversion of chemical structure diagrams in scientific research articles, several software systems have been developed. But their algorithmic performance and utility in cheminformatic research have not been investigated.

Results: This paper aims to provide critical reviews for these systems and also report our recent development of ChemReader - a fully automated tool for extracting chemical structure diagrams in research articles and converting them into standard, searchable chemical file formats. Basic algorithms for recognizing lines and letters representing bonds and atoms in chemical structure diagrams can be independently run in sequence from a graphical user interface-and the algorithm parameters can be readily changed-to facilitate additional development specifically tailored to a chemical database annotation scheme. Compared with existing software programs such as OSRA, Kekule, and CLiDE, our results indicate that ChemReader outperforms other software systems on several sets of sample images from diverse sources in terms of the rate of correct outputs and the accuracy on extracting molecular substructure patterns.

Conclusion: The availability of ChemReader as a cheminformatic tool for extracting chemical structure information from digital raster images allows research and development groups to enrich their chemical structure databases by annotating the entries with published research articles. Based on its stable performance and high accuracy, ChemReader may be sufficiently accurate for annotating the chemical database with links to scientific research articles.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Automated extraction of chemical structure information in scientific articles.
Figure 2
Figure 2
Recognition of chemical structure diagram images in ChemReader. (a) input image, (b) character-line separation, (c) bond recognition, (d) character recognition, and (e) topology construction and data output.
Figure 3
Figure 3
Graphical User Interface (GUI) of ChemReader.
Figure 4
Figure 4
Hough Transformation for bond detection. (a) Cartesian Image Space, (b) Polar Hough Space, (c) Example of HT applied to a chemical structure image, and (d) Hough Space corresponding to (c).
Figure 5
Figure 5
Considering line thickness and connectivity in the HT. Vertical lines could have priority over horizontal lines in a modified HT due to their connectivity (a) and thickness (b).
Figure 6
Figure 6
Recovering process for characters glued to graphics. A sub image which has (a) a character component connected to a graphic component, (b) line detection result of (a), and (c) Correctly separated characters.
Figure 7
Figure 7
Detection of streochemical wedge bond. (a) Detected corner points (Red Block) around a wedge bond, (b) a combination of 3 corner points, NBi = Number of Black Pixels in each region, and (c) wrongly detected wedge.
Figure 8
Figure 8
Sequential steps for bond detection. (a) Original Image, (b) Detected corner points after removing character components, (c) Detected normal and wedge bonds, and (d) left pixels before dashed bond detection.
Figure 9
Figure 9
Left over pixels before hatched bond detection. (a) Original Image, (b) left pixels before hatched bond detection, and (c) Most voted line from HT and line segments orthogonal to it.
Figure 10
Figure 10
Detection of aromatic ring bond. (a) Chemical structure of Naphthalene, (b) Connected components and distribution of distances of pixels from component's center for (c) Circle bonding and for (d) Non circle bonding.
Figure 11
Figure 11
Common character recognition errors. (a) low resolution, (b) broken character, (c) glued to a graphic component, and (d) glued characters.
Figure 12
Figure 12
Topology construction procedure. (a) detected bonds (lines) and symbols (rectangle), (b) created nodes (bold dots), and (c) final nodes and edges.
Figure 13
Figure 13
Percent of correct outputs and Average Tanimoto similarity scores over total outputs.
Figure 14
Figure 14
Output examples. (a) input images, and results by (b) ChemReader, (c) OSRA, (d) CLiDE, and (e) Kekule.