I report here some highlights of the keynote speech by Xin Luna Dong at the 16th International Conference on Web Engineering (ICWE 2016). Incidentally, she is now moving to Amazon for starting a new project on building an Amazon knowledge base.
Building knowledge bases still remains a challenging task.
First, one has to decide how to build the knowledge: automatically or manually?
A survey in 2014 reported the following list of large efforts in knowledge building: the top 4 approaches are manually curated, the bottom 3 are automatic.
Google's knowledge vault and knowledge Graph are the big winners in terms of volume.
When you move to long tail content, curation does not scale. Automation must be viable and precise.
This is in line with our own research line we are starting on Extracting Changing Knowledge (we presented a short paper at a Web Science 2016 workshop last month). Here is a summary of our approach:
On the Quest for Changing Knowledge. Capturing emerging entities from social media. WebScience 2016 DDI
Where knowledge can be extracted from? In Knowledge Valut:
- largest share of the content comes from DOM structured documents
- then textual content
- then annotated content
- and a small share from web tables
Knowledge Vault is a matrix based approach to knowledge building, with rows = entities and columns= attributes.
It assumes the entities to be available (e.g. in Freebase), and builds a training over that.
One can build KBs by building buckets of triples, with similar probability of being correct. It's important to precisely estimate correctness probability.
Errors can include mistakes on:
- triple identification
- entity linkage
- predicate linkage
- source data
When extracting knowledge, the ingredients are: datasource, extractor approach, the data items themselves, facts and their probability of truth.
Several models can be used for extracting knowledge. Two extremes of the spectrum are:
- Single-truth model. Every fact has only one truth. We trust the value of the highest number of datasources.
- Multilaeyer model. separates source quality from extractor quality and data errors from extraction errors. One can build a knowledge-based trust model, defining trustworthiness of web pages. One can compare this measure with respect to page rank of web pages:
Overall, a lot of ingredients influence the correctness of knowledge: temporal aspects, data source correctness, capability of extraction and validation, and so on--
In summary: Plenty of research challenges to be addressed, both by the datascience and modeling communities!
To keep updated on my activities you can subscribe to the RSS feed of my blog or follow my twitter account (@MarcoBrambi).