DeepDive: Declarative Knowledge Base Construction

Christopher De Sa¹, Alex Ratner¹, Christopher Ré¹, Jaeho Shin¹, Feiran Wang¹, Sen Wu¹, Ce Zhang¹

Affiliations

PMID: 28344371
PMCID: PMC5361060

DeepDive: Declarative Knowledge Base Construction

Christopher De Sa et al. SIGMOD Rec. 2016 Mar.

. 2016 Mar;45(1):60-67.

Epub 2016 Feb 6.

Authors

Christopher De Sa¹, Alex Ratner¹, Christopher Ré¹, Jaeho Shin¹, Feiran Wang¹, Sen Wu¹, Ce Zhang¹

Affiliation

¹ Stanford University.

PMID: 28344371
PMCID: PMC5361060

Abstract

The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.

PubMed Disclaimer

Figures

**Figure 1**
Knowledge Base Construction (KBC) is the process of populating a structured relational knowledge base from unstructured sources. DeepDive is a system aimed at facilitating the KBC process by allowing domain experts to integrate their domain knowledge without worrying about algorithms.

**Figure 2**
Example KBC Application Built with DeepDive.

**Figure 3**
Quality of KBC systems built with DeepDive. On many applications, KBC systems built with DeepDive achieves comparable (and sometimes better) quality than professional human volunteers, and leads to similar scientific insights on topics such as biodiversity. This quality is achieved by iteratively integrating diverse sources of data- often quality scales with the amount of information we enter into the system.

**Figure 4**
One challenge of building high-quality KBC systems is dealing with diverse sources jointly to make predictions. In this example page of a Paleontology journal article, information extracted from tables, text, and external structured knowledge bases are all required to reach the final extraction. This problem becomes even more challenging when many extractors are not 100% accurate, thus motivating the joint probabilistic inference engine inside DeepDive.

**Figure 5**
Another challenge of building high-quality KBC systems is that one usually needs to deal with data at the scale of tera-bytes. These data are not only processed with traditional relational operations, but also operations involving machine learning and statistical inference. Thus, DeepDive consists of a set of techniques to speed up and scale up inference tasks involving billions of correlated random variables.

**Figure 6**
A KBC system takes as input unstructured documents and outputs a structured knowledge base. The runtimes are for the TAC-KBP competition system. To improve quality, the developer adds new rules and new data with error analysis conducted on the result of the current snapshot of the system. DeepDive provides a declarative language to specify each type of different rules and data, and techniques to incrementally execute this iterative process.

**Figure 7**
An example KBC system. See Section 3.2 for details.

**Figure 8**
Schematic illustration of grounding. Each tuple corresponds to a Boolean random variable and node in the factor graph. We create one factor for every set of groundings.

See this image and copyright information in PMC

References

1. Angeli G, et al. Stanford’s 2014 slot filling systems. TAC KBP. 2014
1. Banko M, et al. Open information extraction from the Web. IJCAI. 2007
1. Betteridge J, Carlson A, Hong SA, Hruschka ER, Jr, Law EL, Mitchell TM, Wang SH. Toward never ending language learning. AAAI Spring Symposium. 2009
1. Brin S. Extracting patterns and relations from the world wide web. WebDB. 1999
1. Brown E, et al. Tools and methods for building watson. IBM Research Report. 2013

Grants and funding

U54 EB020405/EB/NIBIB NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

DeepDive: Declarative Knowledge Base Construction

Affiliation

DeepDive: Declarative Knowledge Base Construction

Authors

Affiliation

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources