We have posted previously about Pingar’s research project focusing on creating taxonomies on the fly and the challenges such a system would face. This project came a long way from being just a prototype and from this month our Taxonomy Generator is available to selected customers for beta testing. This post reveals its inner workings.
Enterprise taxonomies are the most effective tools for organizing documents. They ensure consistency when describing a document’s content, they drive high-quality metadata, which allows us to find that document in many different places, depending on our perspective at the time of searching. On the other hand, folders, the traditional way of organizing documents, require each document to appear in a single place, they are difficult to maintain and quickly get messy. Information professionals consider folders to be the new “F” word! So why do most people and organizations still use folders?
The answer is cost. Creating useful taxonomies takes a lot of time and effort. There are many ready to use taxonomies out there covering various verticals, but they can rarely be used out of the box for most organizations, because they are either too narrow in their coverage of some topics or too deep in their coverage of others. Large resources like DBPedia and Freebase guarantee to cover millions of topics, but would be too large to be useful as a single taxonomy for a given organization. And even they will miss all those very specific topics, such as project names, employees, office locations, or specialist terminology.
We argue that a useful taxonomy is one that contains terms relevant to the documents it is meant to organize. These terms can be sourced from existing taxonomies, Wikipedia, using entity and terminology extraction algorithms. Then, it’s the matter of grouping these terms into a meaningful hierarchy.
The image below explains how the Pingar Taxonomy Generator works. It receives as an input, documents in various formats, which may be stored on a file-share, in a document management system such as SharePoint, or on an Exchange Mail server. These documents are then processed and analyzed using a variety of tools and datasets, in order to extract taxonomy terms and relations between them. The output is a taxonomy, which combines these terms and relations into a single hierarchical structure useful for document organization.
A closer look at the process of an automated creation of a taxonomy reveals five clearly defined steps:
In the first step we import documents into Apache Solr using the Apache Tika library to extract the content of these documents. In the second step, we apply various entity extraction techniques to identify concepts of interest. This includes relevant terms from existing taxonomies and the customer can decide which ones may contain relevant terms. At this stage, we also identify names of people, organizations and locations of interest. The Pingar API allows us to perform both of these tasks at once. We also extract relevant Wikipedia articles and specific terminology of interest. In step three, we connect identified entities with their corresponding IDs in Linked Data sources such as Freebase. This allows us to draw more knowledge about these entities to use it in the following steps. In step four, we look at concepts from different sources that were linked to the same phrases in the document and decide whether they mean the same and can be merged into a single taxonomy term. This process is called disambiguation. Finally, in step five, we consolidate all identified concepts and relations between them into the final taxonomy. It’s a process which gradually connects more and more taxonomy terms into a single network, occasionally adding new terms required to connect existing ones. We also use pruning techniques to ensure that no redundant information is included in the final taxonomy.
The resulting taxonomy can be plugged straight-away into the Pingar API for automated metadata extraction for further documents created in the future. To update the taxonomy, one would simply restart the Taxonomy Generator with a new or an extended document collection.
As we like to say with regards to other Pingar’s text analytics capabilities, what the Taxonomy Generator delivers is simple; it delivers a custom taxonomy on the fly; how we do it is complex.
Dr Alyona Medelyan
Pingar Chief Research Officer