Pingar Research Blog RSS Icon RSS

Pingar at ShareFEST 2013 - 9 April, 2013

“Extracting and Mapping SharePoint Content to Create a Custom Taxonomy”

In a couple of weeks I will be heading to Philadelphia to present the latest Pingar research project, the Taxonomy Generator, at ShareFEST 2013, the premier SharePoint community conference for life sciences.

The Pingar Taxonomy Generator creates taxonomies on the fly from any set of documents. The content of the taxonomy is focused on the particular document collection.

Now, can you think of a platform that houses content of various natures and in many different formats? SharePoint!

Can you think of a system that uses taxonomies to effectively classify document content via metadata assignment? SharePoint!

And can you think of a domain that can benefit from accurate and consistent metadata assignment for records management and also realize benefit from knowledge discovery? Life Sciences!

Let’s take a closer look!


Pingar Taxonomy Generator – Overview

Our Taxonomy Generator is truly novel technology. (Our research paper is accepted in ESWC 2013 - a major venue for the latest innovations in semantic technologies.)

It works in 5 steps:

1. Collect documents and convert to text
Given a particular source, where documents are stored, for example, SharePoint, we first crawl this source to collect all relevant documents and extract the textual content from each of them.

2. Extract concepts
We apply various entity extraction techniques to identify concepts of interest. We use the Pingar API to detect relevant terms from existing taxonomies and we also identify names of people, organizations and locations of interest. Custom entities can be defined to match your business requirements. We also extract relevant Wikipedia articles and specific terminology of interest.

3. Annotate with Linked Data
In this step, we connect identified entities with their corresponding IDs in Linked Data sources such as Freebase. This allows us to draw more knowledge about these entities to use it in the following steps.

4. Disambiguate clashing concepts
We look at concepts from different sources that were linked to the same phrases in the document and decide whether they mean the same and can be merged into a single taxonomy term. This process is called disambiguation.

5. Consolidate the taxonomy
Finally, we consolidate all identified concepts and relations between them into the final taxonomy. It’s a process which gradually connects more and more taxonomy terms into a single network, occasionally requiring the addition of new terms required to connect existing ones. We also use pruning techniques to ensure that no redundant information is included in the final taxonomy.


Pingar Taxonomy Generator – Benefits

Taxonomies are the most effective tools for organizing documents. They ensure consistency when describing a document’s content, which in turn enables document management, storing and retrieval.

Traditionally, taxonomies have been created manually by domain experts. This means they require time and effort, which can make them very costly!

There are a few publicly or commercially available taxonomies but often these are too broad, too narrow or not specific enough for a particular organization, a department or a specific project. Large resources like DBPedia and Freebase guarantee to cover millions of topics, but would be too large to be useful as a single taxonomy for a given organization. And even they will miss all those very specific topics, such as project names, employees, office locations, or specialist terminology.

Our method of generating taxonomies utilizes a number of different resources automatically (therefore consistently and inexpensively) to produce a taxonomy that contains terms relevant to the documents it is asked to organize.


Pingar Taxonomy Generator & SharePoint

As we know, SharePoint houses content of various natures and in many different formats. It also uses taxonomies that can be used to effectively classify document content via metadata assignment. 

The image below illustrates how the Pingar Taxonomy Generator can be integrated in SharePoint. It receives as input various documents stored in SharePoint. It then analyzes them using a variety of tools and datasets, in order to extract taxonomy terms and relations between them. Domain specific taxonomies stored in SharePoint can also be included in this. The output is a taxonomy, which combines these terms and relations into a single hierarchical structure useful for document organization. The taxonomy can be returned  in SharePoint format (CSV) and ready to be uploaded to a SharePoint Term Set.

anna.jpg


Pingar Taxonomy Generator, SharePoint & Life Sciences

Linked Data and the Semantic Web are mature ideas in the area of bioinformatics and plenty of resources exist (MeSH, OBOFoundry and so on). Pingar Taxonomy Generator can utilize these biomedical resources while analyzing documents for concepts and relations. This allows the owner of a particular biomedical document collection to gain a more focused view of the content of the documents. If we consider the case of a researcher who is tasked with analyzing patient records, clinical trials and pertinent scientific articles, a custom taxonomy will provide him or her with a specialized knowledge representation, enabling deeper understanding of the content and a better ability to formulate further hypotheses.

 Please join us at our session at the ShareFEST 2013 conference to learn more! We plan to tell you more about the system and show you a few biomedical taxonomies that we generated automatically.


Anna Divoli, PhD
Pingar Research


UI preferences for viewing hierarchical data - 19 February, 2013

We have blogged several times in the past on projects that involve User Interface (UI) elements and users’ preferences. Lately we have been working a lot with taxonomies and we want an easy-to-use tool to visualize, browse and edit taxonomies. Since we couldn’t find anything simple and affordable, we decided to build our own taxonomy editor. But before you start developing, you do research! There are some published studies that advise the presentation and design of taxonomy and facet elements but not of all the UI elements we were pondering. Additionally we wanted to conduct our own survey to make sure that (a) the user preferences are current and (b) we collect responses from a group of people that are likely to use our taxonomy editor.

I presented some preliminary results from this survey at TAW 2012 in Boston (you can see the slides from this talk in a previous post “What a week in Boston! TAW & HCIR”). In this post I would like to share with you our final results including some interesting comments from the participants.

The 59 Participants

We sent emails to recruit participants to people we know and we asked them to forward them to their friends and colleagues too. We also tweeted the link to the survey (we estimate that we got about 8 participants through twitter). 65 people started the survey but 59 completed it. Let us introduce you to those 59:

They were based in 10 different countries (Canada, France, Greece, HK, India, NZ, Spain, Switzerland, UK, and USA). We had 18 females and 41 males from various backgrounds: IT (developers/programmers, testers, pre-sales engineers, web-designers, project managers, directors), knowledge management, terminology, and text analytics/NLP experts; as well as consultants, sales people, linguists, graphic designers, media/advertising, HR,  lawyers, engineers, accountants, and researchers in the areas of physics, genetics, and bioinformatics.

Below find their preferences on displaying and searching hierarchical data!

We hope you find the data useful! Remember that we investigated each feature by itself, but it is very important to pay a lot of attention when putting them together to assembly a complete system.

Anna Divoli
Senior Software Researcher

 

I. Presentation:  sorting, counts, expansions and labels

1. We asked:  When you are navigating through a taxonomy, would you like the taxonomy nodes listed by popularity or alphabetically?

 

Results:

 Some interesting comments:

Participants that selected: alphabetically (B)

  • If the alphabetical nodes also contain the popularity information, then ordering alphabetically loses no information and makes it easier to find a node of interest.
  • It depends what the list is. IF it is something where I would likely understand the popularity and hence order then that is easier. If not, alphabetically is best.
  • Easier to find what I'm looking for.

 

Participants that selected: popularity (A):

  • It really depends on the use case, am I looking for an expected value, or attempting to classify something. Two very different tasks.
  • Alphabetically associated with a much more systematic way of sorting a list of items or products where there's a singularity whereas popularity tends to varies.
  • Should let user select the sorting criteria if possible
  • Because I get a feel that I am seeing some output.

 

Participants that selected: no preference:

  • Depends if am I looking for a particular "car" or not, or what my goal is (in terms of the task i am trying to accomplish

 

2. We asked:  When you are looking at the counts of an item, would you like to see the count of the direct children (A) or of everything underneath (B)?

Results:

 Some interesting comments:

Participants that selected: direct children (A):

  • If the alphabetical nodes also contain the popularity information, then ordering alphabetically loses no information and makes it easier to find a node of interest.
  • It depends what the list is. IF it is something where I would likely understand the popularity and hence order then that is easier. If not, alphabetically is best.
  • Easier to find what I'm looking for.

 

Participants that selected: popularity (A):

  • Because it’s explanatory.
  • Appears curious when you see (5) by Cars in B yet you only see three cars

 

Participants that selected: everything underneath (B):

  • You have to show everything, or you don't actually have a taxonomy do you?
  • Depends on what I'm browsing for. Can't I have both? (5/27) on the example on the left?
  • Depends on the size of the list.

 

3. We asked: Do you prefer the nodes of a taxonomy being displayed in frames (A) or with labels (B)?

Results:

 Some interesting comments:

Participants that selected: in frames (A):

  • In this particular case I choose (A) However, I wouldn't mind (B) but the graphic icons that have been used are ugly and overpower the typography. If refined and it was related to a brands look and feel then this could work.
  • B is confusing.

 

Participants that selected: with labels (B):

  • I would propose a simple dot will do, rather than having animated labels or what have you.
  • Easier to read

 

Participants that selected: no preference:

  • I don't like either option, there's no need for a border, and the icon makes no sense.

 

4. We also displayed to the participants different types of labels!

Results:

 

There was no clear preference among the type of labels. The comments indicated that the icons were distracting and unless there is an association of the label with the word/concept, the label should be avoided.

Some interesting comments:

Participants that selected: A (round):

  • Least amount of brain strain
  • Labels with tags definitely detract from reading the text - labels appear just too big. Something small but colour coded like the round dot (although small would be nice)

 

Participants that selected: no preference:

  • Again these are all bad, and make no sense in the minimal context I have. An icon of a bike or of a train, etc. would help greatly though.

 

5. We asked: Do you prefer to expand/collapse taxonomy nodes using plus (+) / minus (-) signs (A) or arrows (B)?

 

Results:

 Some interesting comments:

Participants that selected: plus (+) / minus (-) signs (A):

  • Quicker to see whether a list is expanded
  • Because i am able to understand it and that's the standard followed.

 

Participants that selected: arrows (B):

  • Arrows indicate (open/expand) to reveal drop down list. Plus and minus doesn't make sense. These symbols would be used when adding or removing items to a shopping cart or list of favourites etc

 

Participants that selected: no preference:

  • Both have their advantages and disadvantages

 

 

II. Search:  displaying and highlighting matches, functionality… and that search box!

6. We asked:  Assume you searched for "Chevrolet" and multiple taxonomy nodes matched your query. Would you like the system to return your matched query and just the parent nodes (A), just the query and the children nodes (B), or the query with both parent and children nodes (C)?   [This will help you select the node that interests you the most.]

Results:


Some interesting comments (from participants that selected (C)):

  • Depends on the amount of results - this may be too much information
  • Good to give user all options when searching.
  • The more data the better, always. As long as it is clear, concise, and not redundant.

 

7. We asked: When you search a taxonomy for a term, do you want to return just exact matches or you are interested in partial matches and hidden matches too? Please SELECT ALL that apply!

Results:


Some interesting comments:

  • I'd have to use it to know whether showing C would be useful or not. It may confuse the user (unless the hidden labels are displayed somehow too)
  • Hidden match is outside what I expect to happen - it does not match my mental model. I can find Chevrolet and drill down to find Impala LS. It would be nice to have a soundex or fuzzy search in case I misspell Chevrolet
  • Personally I like the whole package, but for some people it would be better to have just exact ones or partial, for they will decide to browse or not based on a limited choices rather than with tones of choices but not know what to do.
  • Everything should be found top level. However I do think you could include a refined search to give the user the option of how they want to search. www.autotrader.co.uk do this very well.
  • Should show all unless modifier used in search.

 

8. We asked: Would you like exact, partial and hidden matches to be highlighted in a different manner in the search output?

Results:


Some interesting comments:

Participants that selected: all the same (A):

  • All the same, introducing new colour schemes could only add confusion, the content is still relevant to the category.
  • Too many colors get confusing.

 

Participants that selected: exact vs. the rest (B):

  • Simple is much better.
  • I would just deem partial, hidden and other match types as 'alternative matches'. I don't think partial or hidden matches means anything to the end-user because it is technical jargon that the creators of the tool use.

 

Participants that selected: no preference:

  • Only if there was a key or training that made this obvious to me. Why not an option that highlights just the search terms like every other search.
  •  

9. Many have asked this… We dared to ask too: Which search box do you prefer: A, B, C or D? Do you like the use of shadow in the search box (1) or prefer it without shadow (2)?

Results:


Some interesting comments:

  • I didn't even notice there was a difference between 1 and 2 until you pointed it out. It probably depends on use, if it's on a small screen, no shadow would probably heighten contrast and increase usability. On big screens like the one I'm using, obviously I didn't even notice. [Participant selected (A) and no preference on shadow]
  • My preferences here are pretty typical of current trends. Simplicity is king at the moment. [Participant selected (B) and no shadow]
  • The shadow part is entirely dependent on the style of the rest of the UI. [Participant selected (A) and with shadow]

 

Explore Pingar