Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb;132(2):27006.
doi: 10.1289/EHP13215. Epub 2024 Feb 13.

Standardizing Extracted Data Using Automated Application of Controlled Vocabularies

Affiliations

Standardizing Extracted Data Using Automated Application of Controlled Vocabularies

Caroline Foster et al. Environ Health Perspect. 2024 Feb.

Abstract

Background: Extraction of toxicological end points from primary sources is a central component of systematic reviews and human health risk assessments. To ensure optimal use of these data, consistent language should be used for end point descriptions. However, primary source language describing treatment-related end points can vary greatly, resulting in large labor efforts to manually standardize extractions before data are fit for use.

Objectives: To minimize these labor efforts, we applied an augmented intelligence approach and developed automated tools to support standardization of extracted information via application of preexisting controlled vocabularies.

Methods: We created and applied a harmonized controlled vocabulary crosswalk, consisting of Unified Medical Language System (UMLS) codes, German Federal Institute for Risk Assessment (BfR) DevTox harmonized terms, and The Organization for Economic Co-operation and Development (OECD) end point vocabularies, to roughly 34,000 extractions from prenatal developmental toxicology studies conducted by the National Toxicology Program (NTP) and 6,400 extractions from European Chemicals Agency (ECHA) prenatal developmental toxicology studies, all recorded based on the original study report language.

Results: We automatically applied standardized controlled vocabulary terms to 75% of the NTP extracted end points and 57% of the ECHA extracted end points. Of all the standardized extracted end points, about half (51%) required manual review for potential extraneous matches or inaccuracies. Extracted end points that were not mapped to standardized terms tended to be too general or required human logic to find a good match. We estimate that this augmented intelligence approach saved >350 hours of manual effort and yielded valuable resources including a controlled vocabulary crosswalk, organized related terms lists, code for implementing an automated mapping workflow, and a computationally accessible dataset.

Discussion: Augmenting manual efforts with automation tools increased the efficiency of producing a findable, accessible, interoperable, and reusable (FAIR) dataset of regulatory guideline studies. This open-source approach can be readily applied to other legacy developmental toxicology datasets, and the code design is customizable for other study types. https://doi.org/10.1289/EHP13215.

PubMed Disclaimer

Figures

Figure 1 is a flowchart with two mapping examples. Example 1: U M L S Vocabulary: Example (codes and definition): C 0000768, C 0018798, and C 0456389: Congenital abnormality, heart, and size leads to BfR DevTox Vocabulary: Example (codes and definition): 3.1047.5088: Visceral, heart, and large and 3.1047.5211: Visceral, heart, and small and O E C D vocabulary: Example (codes and definition): O 74.186.95: Fetuses, fetal abnormalities, visceral or soft tissue, and cardiovascular. Example 2: U M L S Vocabulary: Example (codes and definition): C 0000768, C 0015392, and C 0332197: Congenital abnormality, eye, absent; C 0000768, C 0015392, and C 0332197: Congenital abnormality, eye, agenesis leads to BfR DevTox Vocabulary: Example (codes and definition): 3.1032.5002: Visceral eye absent and O E C D vocabulary: Example (codes and definition): O 74.186.66: Fetuses, fetal abnormalities, external eye.
Figure 1.
Conceptualization of one-to-many and many-to-one crosswalk mappings. The top half of Figure 1 illustrates an example of a one-to-many match where a less specific Unified Medical Language System (UMLS) term mapped to multiple, more specific German Federal Institute for Risk Assessment (BfR) DevTox terms. Both BfR DevTox terms are equally relevant to the UMLS term. The bottom half of the figure is an example of a many-to-one match where multiple UMLS terms map to the same BfR DevTox term and to the same Organization for Economic Cooperation and Development (OECD) term.
Figure 2 is a tabular representation with three columns, namely, U M L S vocabulary, BfR DevTox vocabulary, and O E C D vocabulary. Row 1: C 00007686, C 0015392, C 0332197: Congenital abnormality, eye, agenesis; 3.1032.5211: Visceral, eye, small; and O 74.186.66: fetuses, fetal abnormalities, external eye. Row 2: C 00007686, C 0015392, C 0332197: Congenital abnormality, eye, agenesis; 3.1032.5002: Visceral, eye, absent; and O 74.186.66: fetuses, fetal abnormalities, external eye. Row 3: C 00007686, C 0015392, C 0332197: Congenital abnormality, eye, agenesis; 3.1032.5211: Visceral, eye, small; and O 74.186.108: fetuses, fetal abnormalities, visceral or soft tissue eye. Row 4: C 00007686, C 0015392, C 0332197: Congenital abnormality, eye, agenesis; 3.1032.5002: Visceral, eye, absent; and O 74.186.66: fetuses, fetal abnormalities, visceral or soft tissue eye. U M L S vocabulary is linked with BfR DevTox vocabulary and O E C D vocabulary.
Figure 2.
Example section of the completed crosswalk and the first phase of mapping. This figure illustrates the first mapping phase where we developed a crosswalk (an annotation of the overlap between controlled vocabularies) between the three vocabularies. It illustrates both a complex and a simple mapping. The four rows contain the same Unified Medical Language System (UMLS) term, as the mapping for the UMLS term was a one-to-many match for both German Federal Institute for Risk Assessment (BfR) DevTox and Organization for Economic Cooperation and Development (OECD). The result of combining the mappings into one crosswalk is four rows, each with a different combination of BfR DevTox and OECD terms paired to the same UMLS term. The BfR DevTox and OECD terms are not paired to one another but to the UMLS term only. This means the crosswalk should be read left to right but only from UMLS to BfR DevTox and UMLS to OECD separately as indicated by the arrows. This design ensures the annotation code that the crosswalk was designed for will be able to find the best matches from each vocabulary for a given extracted end point. Details on how the code works to do this are presented in section “Annotation Automation.”
Figure 3 is a set of four tables. On the top-left, a tabular representation titled localizations has two columns, namely, Localization and synonym. Row 1: Eye. Row 2: Cranium and Cranial. Row 3: Digit asterisk and Phalange asterisk. Row 4: Lens and Eye. Row 5: Retina and eye. At the bottom-left, a tabular representation titled Unique words has two columns, namely, Unique word and synonym. Row 1: Non-Live and Dead fetuses. Row 2: Salivation and Drooling. Row 3: Premature birth and delivered early. Row 4: Dams died and mortality. Row 5: dead or removed and mortality. On the top-right, a tabular representation titled observations has two columns, namely, observations and synonym. Row 1: absence and missing. Row 2: absence and agenesis. Row 3: duplicat asterisk and double. Row 4: large and big. Row 5: reduced number and fewer. At the bottom-right, a tabular representation titled Combo words have three columns, namely, Combo word, Localization and Observation. Row 1: acephaly, head, and absent. Row 2: adactyly, digit, and absent. Row 3: anophthalmia, eye, absent. Row 4: anophthalmos, eye, absent. Row 5: anury, tail, and absence.
Figure 3.
Example excerpts from the four User-Defined Look-Up Lists. Excerpts showing words and related terms from each of the four User-Defined Look-Up Lists: Localization words, Observation words, Combination words, and Unique words. These lists serve as input files to the annotation code, which uses these lists to link the extracted end points to the controlled vocabulary terms in the controlled vocabulary crosswalk. The process relies on Boolean logic (i.e., the asterisk represents wildcard searching where any word that includes the letters before the asterisk will be identified).
Figure 4 is a flowchart with three steps. Step 1: Primary Source Extracted Endpoints: Example: Fetuses with small eyes. Step 2: User-Defined Look-Up Lists that includes Localizations, Observations, Combo Words, Unique Words. Example: Localizations: Eye, eyes; Observations: Small, agenesis: Combo Words: microphthalmos: and Unique words. Step 3: U M L S vocabulary: Example: U M L S; C 0000768; C U I; Congenital Abnormality, U M L S; C 0015392; C U I; Eye, U M L S; C 0000846; C U I; Agenesis. U M L S; C 0015392; C U I; Eye, U M L S; C 4086369; C U I; Gross Pathology Result, U M L S; C 0392756; C U I; Reduced, U M L S; C 0456389; C U I; size. U M L S; C 0000768; C U I; Congenital Abnormality, U M L S; C 0015392; C U I; Eye, U M L S; C 0023317; C U I; Lens, Crystalline, U M L S; C 0700321; C U I; Small. BfR DevTox: Example: 3.1032.5211 Visceral, eye, small and 3.1161.5211 visceral, lens, small. O E C D 74: Example: O 74.186.66 fetuses, fetal abnormalities, external, eye. U M L S vocabulary, BfR DevTox, and O E C D 74 are interconnected with each other. The legend is divided into six parts, namely, cross-walked, apply x to y, search for word matches (exact matches or synonyms), controlled vocabularies, primary source extractions, and user-defined look-up lists.
Figure 4.
Illustration of the annotation code logic and the second phase of mapping. This figure illustrates a simplified version of the three-step approach taken by the code to match extracted end points to terms from each of the three controlled vocabularies. The code completes step 3 three times, each time replacing which controlled vocabulary it searches for matches. The code then pulls in the terms from the other two controlled vocabularies that are linked in the crosswalk to any found matches. During this process if a UMLS term is not found, the code will pull in BfR terms based on an OECD match or an OECD term based on a BfR match. After the code runs on the three controlled vocabularies, the code will use the crosswalk between vocabularies to search for matches. After the conclusion of the process, the end points may have more than a single match (e.g., a UMLS term can have many matches in OECD or BfR). Note: BfR, German Federal Institute for Risk Assessment; OECD, Organization for Economic Co-operation and Development; UMLS, Unified Medical Language System.
Figure 5 is a flowchart with three steps. Step 1: Did the extracted effect receive a U M L S term via automation? If yes, then Did extracted endpoint receive BfR DevTox and O E C D terms? If no, then Manually map to UMLS. Step 2: Did extracted endpoint receive BfR DevTox and O E C D terms? If yes, then Terms from 3 controlled vocabularies applied to all rows where possible, Apply Crosswalk, and Manually Map BfR DevTox or O E C D terms. If no, then Are the terms Crosswalk Compatible? Step 3: Are the terms Crosswalk Compatible? If yes, Apply Crosswalk. If no, then Manually Map BfR DevTox or O E C D terms.
Figure 5.
Manual mapping workflow of extracted end points not standardized to controlled vocabulary terms by the automated annotation code. Crosswalk compatible is defined as all three conditions being met: a) did not receive a Unified Medical Language System (UMLS) term from code, b) has only one UMLS term applied, and c) applied UMLS term crosswalks to no more than one German Federal Institute for Risk Assessment (BfR) DevTox and one Organization for Economic Cooperation and Development (OECD) term.
Figure 6 is a set of two tables. On the top, a tabular representation, in four columns, lists Categories, mapped by automation, manually mapped to general or reproductive terms, manual decision making required by accurate matches. Row 1: Overall, 72.3 percent (28,392 per 39,262), 7.8 percent (3,078 per 39,262), and 19.8 percent (7,792 per 39,262). Row 2: N T P dataset, 75.0 percent (25,023 per 33,365), 5.2 percent (1,748 per 33,365), and 19.8 percent (6,594 per 33,365). Row 3: E C H A dataset, 57.1 percent (3,369 per 5,897), 22.6 percent (1,330 per 5,897), and 20.3 percent (1,198 per 5,897). At the bottom, a tabular representation, in four columns, lists Categories, automated mapping without manual review required, automated mapping with manual review required, and no automated mapping required manual mapping. Row 1: Overall, 35.3 percent (13,865 per 39,262), 37.0 percent (14,527 per 39,262), and 27.7 percent (10,870 per 39,262). Row 2: N T P dataset, 34.5 percent (11,501 per 33,365), 40.5 percent (13,522 per 33,365), and 25.0 percent (8,342 per 33,365). Row 3: E C H A dataset, 40.1 percent (2,364 per 5,897), 17.0 percent (1,005 per 5,897), and 42.98 percent (2,528 per 5,897). Both the tables are interconnected with each other.
Figure 6.
Results of automated controlled vocabulary term application (annotation code) and impact on manual effort requirements. The top table in Figure 6 provides the breakdown of extracted end points automatically mapped to controlled vocabulary terms by the code (green) vs. extracted end points that were not automatically mapped and instead required manual mapping (blue). Extracted end points required manual mapping because the extracted end points did not have precise controlled vocabulary term matches and could either only be mapped to general and reproductive terms (8%) or could be mapped to more specific terms but only by using human logic (20%). The bottom table in Figure 6 illustrates the breakdown of manual effort required when using the automated annotation code. Manual effort was eliminated for the 35% of extractions that were automatically mapped via the annotation code and did not require manual review, halved for the 37% of extractions that were automatically mapped via the annotation code but did require manual review, and unaffected for the 28% of extractions that required manual mapping.

Similar articles

Cited by

References

    1. WHO (World Health Organization). 2021. Framework for the Use of Systematic Review in Chemical Risk Assessment. Geneva, Switzerland: WHO, Chemical Safety and Health Unit. https://www.who.int/publications/i/item/9789240034488 [accessed 8 October 2021].
    1. Hood RD. 2016. Developmental and Reproductive Toxicology: A Practical Approach. Boca Raton, FL: CRC Press.
    1. Lea IA, Gong H, Paleja A, Rashid A, Fostel J. 2016. CEBS: a comprehensive annotated database of toxicological data. Nucleic Acids Res 45(D1):D964–D971, PMID: , 10.1093/nar/gkw1077. - DOI - PMC - PubMed
    1. US EPA (US Environmental Protection Agency). 1991. Guidelines for developmental toxicity risk assessment. Fed Reg 56(234):63798–63826.
    1. Barker DJ. 2007. The origins of the developmental origins theory. J Intern Med 261(5):412–417, PMID: , 10.1111/j.1365-2796.2007.01809.x. - DOI - PubMed

LinkOut - more resources