Archive for the 'Semantic Web' Category

Recontextualization of metadata – part 3: semantic web

Which role do metadata play in the semantic web? The latter aims at structuring the contents of the web in a way which makes it possible for machines to find and combine information. At present, there are two approaches which lead to this goal, one bottom-up and one top-down. In short, the bottom-up approach depends on metadata being added to web pages by content providers while in the top-down approach a third party crawls the web or a selection of sites and automatically generates a set of metadata for certain pages.

Today, a number of semantic web applications give a glimpse into the prospects of what can be done with highly structured data. It is interesting to note that all examples come from projects embracing the top-down approach, not from the bottom-up approach which asks for systematic assignment of metadata. I’d like to take a closer look at Freebase Parallax. The power of Parallax is to aggregate information from different sources and make it visible on one screen – in a parallel way (hence the name, I assume) instead of the usual sequential succession of sources. The data Parallax uses are mainly from the Wikipedia, so don’t expect any groundbreaking new facts. But it’s still quite an awesome application, at least when you stick to the examples used in the demo video. (When experimenting yourself, it is rather difficult to find good examples, but there are a few more on Infobib (in German).)

The first example from the demo video is a search for Abraham Lincoln’s children and shows how Parallax can arrange all four entries on one page instead of the user having to navigate to each single page.

Abraham Lincoln's Children on Parallax

Abraham Lincoln's Children on Parallax

In a next step, the narrator of the video expands his search to all children of all American presidents, showing Robert Todd Lincoln alongside with children from George Bush, Sr. and Jr., and then goes on to find all schools American presidents’ children attended.

Lincoln's son alongside the Bushes children

Lincoln's son alongside the Bushes children

What has happened is that the Wikipedia entry on Abraham Lincoln was stripped of its context and reduced to two relationships: a) Abraham Lincoln was a president of the US, and b) Abraham Lincoln had children. When the same is done for the rest of the Wikipedia, then all entries with the attribute «American president» and «has children» can be combined. Like in a database, you can match all X which have an attribute Y (e.g. a school) – without even having to give this attribute Y a value (e.g. the name of a certain school).

The examples shows that the de-contextualization (or reduction of Robert Todd Lincoln to «child of US president») has great potential in terms of creating new associations of information. A number of Firefox extensions make use of this and combine information about books, films etc. with people who have visited the according web sites (Glue) or link terms to other sources according to their classificaton as person, company, country etc. (Gnosis). The re-contextualization, however, is tricky. If too much context is suppressed, the results become ridiculous, as the link from Abraham Lincoln (correctly identified as a person) to LinkedIn.

Gnosis' links for Abraham Lincoln

Gnosis' links for Abraham Lincoln

Also, the ability of machines to categorize information still seems to be very limited. Glue (called «BlueOrganizer» when I took the screenshot), for instance, can only recognize books on websites such as Amazon or Barnes & Nobles, but not on the Library of Congress’ site. And the insight of recognizing a category comes across to the human user as rather silly.

Blue Organizer recognizes that you're looking at a book!

Blue Organizer recognizes that you're looking at a book!

This lack cannot be attributed to the absence of metadata. Zotero (mentioned earlier on this blog), for instance, manages to identify books on many more sites, of libraries as well as on-line bookshops. It just shows how much effort is necessary to recognize structure on the web as such. This is exactly what the top-down approach achieves: It identifies concrete questions (e.g. who else read this book?) as well as sources which might be relevant to answer them (e.g. sites which provide links from a book to people who have read it) and then creates new relationships between these data. This is a much more selective approach than the bottom-up one, and I believe it is more promising. The bottom-up approach either doesn’t go beyond today’s databases or it aims at the maximum amount of options for future relationships. The bottom-up approach is too unspecific to make it attractive and there are too many uncertainties about its usefulness. This, I belive, is the main reason why it isn’t being adopted. For reasons of semantics, I also doubt it will be possible to achieve useful results by automatically mapping data from heterogeneous sources via generic ontologies in the near future (see the above example of Abraham Lincoln in LinkedIn).

Maybe the bottom-up approach should be viewed without rigid technical requirements, but more openly, in the sense of carefully produced content which

  • embeds information in its context (e.g. origin, use, related information) and/or
  • structures content within a certain scope by using (more or less) standardized elements.
Standardized display of a city on Wikipedia

Standardized display of a city on Wikipedia

This leads me back to my initial question: Are metadata good for findability? The semantic web, at least at present, is another example that adding descriptors is neither popular with content producers nor seems to be considered crucial for automated processing. But metadata aren’t restricted to descriptors which are added to content; the features mentioned above are pure metadata and absolutely essential for creating structured search. And structured search, or variations of it like faceted navigation or or the semantic web applications mentioned above, have a great advantage over pattern-matching search engines. Instead of only being able to look for X, it is possible to look for something you do not exactly know except that it should have the characteristic Y.

Carefully structured, metadata-rich information may not be crucial for search engines which rely on pattern-matching. But it opens new perspectives for exploring related content, for finding answers without exactly knowing the question.

Introductions to the Semantic Web

The idea of the Semantic Web has been around for a long time. Tim Berners-Lee articulated the idea in his plenary talk of the first W3 conference in 1994 and published a famous article in Scientific American in 2001. Basically, the idea of the Semantic Web is to give data a format in which computers can process their meaning. However, the idea hasn’t really picked up speed in all those years. This has changed recently, and a series of articles (incidentally published on netzwertig, a blog produced by Zeix’ sister company Blogwerk) explains the nuts and bolts of the Semantic Web in plain German. Part 1 gives a general introduction (Semantisches Web Teil 1: Was steckt hinter dem Begriff?), part 2 explains the technical background (Die technische Umsetzung) and part 3 gives practical examples (konkrete Anwendungsbeispiele).

I’ve been looking for English equivalents to these articles. The most similar I’ve come across so far is Read Write Web’s The Road to the Semantic Web which explains why the Semantic Web could be important to us («The promise is that we will be doing less of what we are doing now – namely sifting through piles of irrelevant information») next to giving a a short introduction to the data formats used for computer processing. Two later articles on the same blog go into more detail why the implementation of the Semantic Web is proving so difficult. Not only is the technical background hard to understand (Difficulties with the Classic Approach), but transforming the data into a computer readable format is a lot of work, and so far, the reward of the market for taking pains to do this has not been given (Top-Down: A New Approach to the Semantic Web).

The many comments to «Difficulties with the Classic Approach» show that most people attribute the lack of success of the idea of the Semantic Web to the difficulties of its practical implementation: «It’s way too technical and scientific and not really practical for the mere mortal». Albeit its capital S, the subject of semantics plays a minor role in the discussion. And here, in my opinion, lies the crux of the matter: For highly structured and standardized data as for addresses, people, books etc., the corresponding metadata are simple to generate from the semantic point of view, and accordingly, these are the areas in which commercial tools are evolving. For the rest of the contents of the web, the idea of mapping ontologies against each other is daunting at best. Comment No. 17 to the post mentioned above gives some practical examples of the difficulties encountered even with structured data. And another trend on the web, tagging, exactly takes the fuzziness of meaning into account, particularly stressing the importance of connotations, i.e. emotional associations with words, for human beings. I hope to deal with that subject on this blog soon.

Metadata revival?

Metadata is a big thing with archivists and other people concerned with context, but I must admit that in all my professional years, I have never worked on a web project which actually used the Dublin Core Metadata set. The most probable reason that people don’t seem to bother much about metadata – at least in a standardized form – is that popular search engines don’t seem to take them into account. Or at least they didn’t until recently.

Let’s have a look at a (very!) brief history of search result designs:

Search engines I rememer from my early web experience returned results looking somewhat like this (our government portal hasn’t made much progress since):

Search result for passport on www.ch.ch

Then up comes Google and introduces a design which has become pretty much standard:

Google Search for \

More recently, both Google and Yahoo have started introducing structured search results:

Yahoo Search Gallery, Country Profile Armenia

Search results are also increasingly shown in clusters based mainly on format (text in general, news, entries from encyclopedias; images, video etc.):

Yahoo India \

So adding metadata in some kind of standardized form does seem to be a recent trend for

  • clustering search results and
  • displaying search results.

Metadata provided by the creators of web sites are used for these displays. However, these metadata are explicitly not used for search algorithms, as an article on Yahoo and the Future of Search reports. Metadata provided by the creators tends to bias the outcome, and the analysis of broader text corpus by powerful search engines provides more signifcant results than metadata out of context.

Still, the increased use of metadata are pointing to interesting directions:

  • Search results are becoming more context-sensitive. Metadata help the user to choose the appropriate context, e.g. for disambiguation or clarification of a query. Search interfaces are taking the iterative nature of search into account and getting closer to the process of questions and answers users require to clarify their needs.
  • Possible actions after having found the desired content are beginning to be transferred to the search sites (search engines becoming portals may – or may not – be part of the development). Users can view details, maps or reviews, check opening hours, buy tickets or conduct site-search without having to leave the search results page. This is enabled by a deeper integration of applications into results.

Google Search for NASA

Site-search from search results page