“Extracting and Mapping SharePoint Content to Create a Custom Taxonomy”
In a couple of weeks I will be heading to Philadelphia to present the latest Pingar research project, the Taxonomy Generator, at ShareFEST 2013, the premier SharePoint community conference for life sciences.
The Pingar Taxonomy Generator creates taxonomies on the fly from any set of documents. The content of the taxonomy is focused on the particular document collection.
Now, can you think of a platform that houses content of various natures and in many different formats? SharePoint!
Can you think of a system that uses taxonomies to effectively classify document content via metadata assignment? SharePoint!
And can you think of a domain that can benefit from accurate and consistent metadata assignment for records management and also realize benefit from knowledge discovery? Life Sciences!
Let’s take a closer look!
Pingar Taxonomy Generator – Overview
Our Taxonomy Generator is truly novel technology. (Our research paper is accepted in ESWC 2013 - a major venue for the latest innovations in semantic technologies.)
It works in 5 steps:
1. Collect documents and convert to text
Given a particular source, where documents are stored, for example, SharePoint, we first crawl this source to collect all relevant documents and extract the textual content from each of them.
2. Extract concepts
We apply various entity extraction techniques to identify concepts of interest. We use the Pingar API to detect relevant terms from existing taxonomies and we also identify names of people, organizations and locations of interest. Custom entities can be defined to match your business requirements. We also extract relevant Wikipedia articles and specific terminology of interest.
3. Annotate with Linked Data
In this step, we connect identified entities with their corresponding IDs in Linked Data sources such as Freebase. This allows us to draw more knowledge about these entities to use it in the following steps.
4. Disambiguate clashing concepts
We look at concepts from different sources that were linked to the same phrases in the document and decide whether they mean the same and can be merged into a single taxonomy term. This process is called disambiguation.
5. Consolidate the taxonomy
Finally, we consolidate all identified concepts and relations between them into the final taxonomy. It’s a process which gradually connects more and more taxonomy terms into a single network, occasionally requiring the addition of new terms required to connect existing ones. We also use pruning techniques to ensure that no redundant information is included in the final taxonomy.
Pingar Taxonomy Generator – Benefits
Taxonomies are the most effective tools for organizing documents. They ensure consistency when describing a document’s content, which in turn enables document management, storing and retrieval.
Traditionally, taxonomies have been created manually by domain experts. This means they require time and effort, which can make them very costly!
There are a few publicly or commercially available taxonomies but often these are too broad, too narrow or not specific enough for a particular organization, a department or a specific project. Large resources like DBPedia and Freebase guarantee to cover millions of topics, but would be too large to be useful as a single taxonomy for a given organization. And even they will miss all those very specific topics, such as project names, employees, office locations, or specialist terminology.
Our method of generating taxonomies utilizes a number of different resources automatically (therefore consistently and inexpensively) to produce a taxonomy that contains terms relevant to the documents it is asked to organize.
Pingar Taxonomy Generator & SharePoint
As we know, SharePoint houses content of various natures and in many different formats. It also uses taxonomies that can be used to effectively classify document content via metadata assignment.
The image below illustrates how the Pingar Taxonomy Generator can be integrated in SharePoint. It receives as input various documents stored in SharePoint. It then analyzes them using a variety of tools and datasets, in order to extract taxonomy terms and relations between them. Domain specific taxonomies stored in SharePoint can also be included in this. The output is a taxonomy, which combines these terms and relations into a single hierarchical structure useful for document organization. The taxonomy can be returned in SharePoint format (CSV) and ready to be uploaded to a SharePoint Term Set.

Pingar Taxonomy Generator, SharePoint & Life Sciences
Linked Data and the Semantic Web are mature ideas in the area of bioinformatics and plenty of resources exist (MeSH, OBOFoundry and so on). Pingar Taxonomy Generator can utilize these biomedical resources while analyzing documents for concepts and relations. This allows the owner of a particular biomedical document collection to gain a more focused view of the content of the documents. If we consider the case of a researcher who is tasked with analyzing patient records, clinical trials and pertinent scientific articles, a custom taxonomy will provide him or her with a specialized knowledge representation, enabling deeper understanding of the content and a better ability to formulate further hypotheses.
Please join us at our session at the ShareFEST 2013 conference to learn more! We plan to tell you more about the system and show you a few biomedical taxonomies that we generated automatically.
Anna Divoli, PhD
Pingar Research
We have blogged several times in the past on projects that involve User Interface (UI) elements and users’ preferences. Lately we have been working a lot with taxonomies and we want an easy-to-use tool to visualize, browse and edit taxonomies. Since we couldn’t find anything simple and affordable, we decided to build our own taxonomy editor. But before you start developing, you do research! There are some published studies that advise the presentation and design of taxonomy and facet elements but not of all the UI elements we were pondering. Additionally we wanted to conduct our own survey to make sure that (a) the user preferences are current and (b) we collect responses from a group of people that are likely to use our taxonomy editor.
I presented some preliminary results from this survey at TAW 2012 in Boston (you can see the slides from this talk in a previous post “What a week in Boston! TAW & HCIR”). In this post I would like to share with you our final results including some interesting comments from the participants.
The 59 Participants
We sent emails to recruit participants to people we know and we asked them to forward them to their friends and colleagues too. We also tweeted the link to the survey (we estimate that we got about 8 participants through twitter). 65 people started the survey but 59 completed it. Let us introduce you to those 59:

They were based in 10 different countries (Canada, France, Greece, HK, India, NZ, Spain, Switzerland, UK, and USA). We had 18 females and 41 males from various backgrounds: IT (developers/programmers, testers, pre-sales engineers, web-designers, project managers, directors), knowledge management, terminology, and text analytics/NLP experts; as well as consultants, sales people, linguists, graphic designers, media/advertising, HR, lawyers, engineers, accountants, and researchers in the areas of physics, genetics, and bioinformatics.
Below find their preferences on displaying and searching hierarchical data!
We hope you find the data useful! Remember that we investigated each feature by itself, but it is very important to pay a lot of attention when putting them together to assembly a complete system.
Anna Divoli
Senior Software Researcher
I. Presentation: sorting, counts, expansions and labels
1. We asked: When you are navigating through a taxonomy, would you like the taxonomy nodes listed by popularity or alphabetically?

Results:

Some interesting comments:
Participants that selected: alphabetically (B)
Participants that selected: popularity (A):
Participants that selected: no preference:
2. We asked: When you are looking at the counts of an item, would you like to see the count of the direct children (A) or of everything underneath (B)?

Results:

Some interesting comments:
Participants that selected: direct children (A):
Participants that selected: popularity (A):
Participants that selected: everything underneath (B):
3. We asked: Do you prefer the nodes of a taxonomy being displayed in frames (A) or with labels (B)?

Results:

Some interesting comments:
Participants that selected: in frames (A):
Participants that selected: with labels (B):
Participants that selected: no preference:
4. We also displayed to the participants different types of labels!
Results:

There was no clear preference among the type of labels. The comments indicated that the icons were distracting and unless there is an association of the label with the word/concept, the label should be avoided.
Some interesting comments:
Participants that selected: A (round):
Participants that selected: no preference:
5. We asked: Do you prefer to expand/collapse taxonomy nodes using plus (+) / minus (-) signs (A) or arrows (B)?

Results:

Some interesting comments:
Participants that selected: plus (+) / minus (-) signs (A):
Participants that selected: arrows (B):
Participants that selected: no preference:
II. Search: displaying and highlighting matches, functionality… and that search box!
6. We asked: Assume you searched for "Chevrolet" and multiple taxonomy nodes matched your query. Would you like the system to return your matched query and just the parent nodes (A), just the query and the children nodes (B), or the query with both parent and children nodes (C)? [This will help you select the node that interests you the most.]

Results:

Some interesting comments (from participants that selected (C)):
7. We asked: When you search a taxonomy for a term, do you want to return just exact matches or you are interested in partial matches and hidden matches too? Please SELECT ALL that apply!

Results:

Some interesting comments:
8. We asked: Would you like exact, partial and hidden matches to be highlighted in a different manner in the search output?

Results:

Some interesting comments:
Participants that selected: all the same (A):
Participants that selected: exact vs. the rest (B):
Participants that selected: no preference:
9. Many have asked this… We dared to ask too: Which search box do you prefer: A, B, C or D? Do you like the use of shadow in the search box (1) or prefer it without shadow (2)?

Results:

Some interesting comments: