Harvesting Knowledge from Web Data and Text

Abstract : The Web bears the potential of being the world's greatest encyclopedic source, but we are far from fully ex- ploiting this potential. Valuable scientific and cultural content is interspersed with a huge amount of noisy, low- quality, unstructured text and media. The proliferation of knowledge-sharing communities like Wikipedia and the advances in automated information extraction from Web pages give rise to an unprecedented opportunity: Can we systematically harvest facts from the Web and compile them into a comprehensive machine-readable knowledge base? Such a knowledge base would contain not only the world's entities, but also their semantic properties, and their relationships with each other. Imagine a “Structured Wikipedia” that has the same scale and richness as Wikipedia itself, but offers a precise and concise representation of knowledge, e.g., in the RDF format. This would enable expressive and highly precise querying, e.g., in the SPARQL language (or appropriate extensions), with additional capabilities for informative ranking of query results. The benefits from solving the above challenge would be enormous. Potential applications include 1) aformalizedmachine-readableencyclopediathatcanbequeriedwithhighprecisionlikeasemanticdatabase; 2) a key asset for disambiguating entities by supporting fast and accurate mappings of textual phrases onto named entities in the knowledge base; 3) an enabler for entity-relationship-oriented semantic search on the Web, for detecting entities and relations in Web pages and reasoning about them in expressive (probabilistic) logics; 4) a backbone for natural-language question answering that would aid in dealing with entities and their rela- tionships in answering who/where/when/ etc. questions; 5) a key asset for machine translation (e.g., English to German) and interpretation of spoken dialogs, where world knowledge provides essential context for disambiguation; 6) acatalystforacquisitionoffurtherknowledgeandlargelyautomatedmaintenanceandgrowthoftheknowl- edge base. While these application areas cover a broad, partly AI-flavored ground, the most notable one from a database perspective is semantic search: finally bringing DB methodology to Web search! For example, users (or tools on behalf of users) would be able to formulate queries about succulents that grow both in Africa and America, politicians who are also scientists or are married to singers, or flu medication that can be taken by people with high blood pressure. The search engine would return precise and concise answers: lists of entities or entity pairs (depending on the question structure), for example, Angela Merkel, Benjamin Franklin, etc., or Nicolas Sarkozy for the questions about scientists. This would be a quantum leap over today's search where an- swers are embedded if not buried in lots of result pages, and the human users would have to read them to extract entities and connect them to other entities. In this sense, the envisioned large-scale knowledge harvesting [42] from Web sources may also be viewed as machine reading [13].
Document type :
Conference papers
CIKM, 2010, Toronto, Canada. 2010
Liste complète des métadonnées

Cited literature [44 references]  Display  Hide  Download

Contributor : Fabian Suchanek <>
Submitted on : Wednesday, November 10, 2010 - 7:46:54 PM
Last modification on : Thursday, July 20, 2017 - 9:27:44 AM
Document(s) archivé(s) le : Friday, February 11, 2011 - 3:16:33 AM


Files produced by the author(s)


  • HAL Id : inria-00534905, version 1



Hady Lauw, Ralf Schenkel, Fabian Suchanek, Martin Theobald, Gerhard Weikum. Harvesting Knowledge from Web Data and Text. CIKM, 2010, Toronto, Canada. 2010. 〈inria-00534905〉



Record views


Document downloads