Articles

Might we in the future have a single data base storing all human knowledge?

Over recent years, knowledge bases such as Wikipedia, Allociné, or social networks such as Facebook have multiplied via the Internet. These bases, built up by Internauts or created automatically by computers, are becoming more and more important or containing more and more data. The problem is that their size makes them extremely difficult to handle and even to study. We can deduce that studies on large-scale data bases would be highly valuable at the present time.

Might we in the future have a single data base storing all human knowledge?

Essentially, a data base is a graph in which each node represents a concept and each link a special relationship between two concepts. These links are therefore of different natures. For example, social networks are data bases where the nodes are people and the link are those that interconnect the people. Today's data bases, such as the social networks or knowledge bases (Allociné, Imdb ...) have millions of nodes and often a hundred or more types of relationship possible. Most bases are collaborative, i.e., these are augmented and improved by the Internauts themselves. They may, naturally, contain errors or double entries. Other vases are created by robots that collect their information from the Web, and consequently, they also are error-prone.

If we wish to fully and efficiently use the bases, then we must be able to identify and correct errors. However, according to Antoine BORDES "the bases have now reached a size such that the error seeking analysis cannot be carried out by human agents. We have to come up with a system to handle them and a software package designed to extract data that doe not appertain to the regularities that underpin the data (as identified by the software above). In most instances, the analysts are faced with false data and they require new investigation."

The ANR project supervised and managed by Antoine BORDES began in January 2013 for duration of 4 years. The objective assigned is to "make the bases more readable and simpler by summary techniques". In order to do this, the UTC Heudiasyc Laboratory will project the bases studied into a vectorial space so that the links among nodes using probability functions. The probability values will enable the research scientists to establish distances between nodes and therefore identify forms of similarity between some nodes. The aim is to group together millions of nodes that more or less contain all the data.

Modelling the data base allows you to see the regular features, i.e., those groups of entities that express similarities or links that express similar things. The vectorial space can then be projected to a 2D surface helping visualisation and analysis. Thus, explains Antoine BORDES "by applying these calculations to the Wordnet data base, where each node represents a group of the word sleeve, which can carry several meanings will be represented by as many nodes as there are meanings), and each link the lexical relationships among the lexicographic meanings (thus, a sleeve is part of a pullover). We can start with a word and determine the other words or synonyms that are "closest" in meaning or connotation".

The algorithm used groups together several European countries, but can also be used to "see" countries close to Europe but not Member States. The algorithm can then suggest missing links or relationships in the knowledge base, using a probabilistic method. It could therefore prove extremely useful to suggest new links in social networks, for example. Likewise it could serve in genetic engineering, using the protein and gene bases to suggest possible interactions between a gene and a given protein, even if, as Antoine points out "this will never replace conventional genetic research, but could suggest new areas for research scientists to investigate".

A longer range objective of the project would consist of merging several complementary knowledge bases together, thereby avoiding double entries, the latter stemming from different encoding procedures. As an example, if we merge two cinema bases, with the link "actor So-and-so played in this film" and "this film's actors are" will inevitably lead to double entries. Our algorithm will detect this and delete as appropriate. Once the bases are correctly merged and 'cleaned up' (double entries, etc. ), they can provide far more information and with a much better quality.