Search | Information Access

Archive for the 'Search' Category

Recontextualization of metadata – part 3: semantic web

Published December 15, 2008 Search , Semantic Web 2 Comments

Which role do metadata play in the semantic web? The latter aims at structuring the contents of the web in a way which makes it possible for machines to find and combine information. At present, there are two approaches which lead to this goal, one bottom-up and one top-down. In short, the bottom-up approach depends on metadata being added to web pages by content providers while in the top-down approach a third party crawls the web or a selection of sites and automatically generates a set of metadata for certain pages.

Today, a number of semantic web applications give a glimpse into the prospects of what can be done with highly structured data. It is interesting to note that all examples come from projects embracing the top-down approach, not from the bottom-up approach which asks for systematic assignment of metadata. I’d like to take a closer look at Freebase Parallax. The power of Parallax is to aggregate information from different sources and make it visible on one screen – in a parallel way (hence the name, I assume) instead of the usual sequential succession of sources. The data Parallax uses are mainly from the Wikipedia, so don’t expect any groundbreaking new facts. But it’s still quite an awesome application, at least when you stick to the examples used in the demo video. (When experimenting yourself, it is rather difficult to find good examples, but there are a few more on Infobib (in German).)

The first example from the demo video is a search for Abraham Lincoln’s children and shows how Parallax can arrange all four entries on one page instead of the user having to navigate to each single page.

Abraham Lincoln's Children on Parallax

In a next step, the narrator of the video expands his search to all children of all American presidents, showing Robert Todd Lincoln alongside with children from George Bush, Sr. and Jr., and then goes on to find all schools American presidents’ children attended.

Lincoln's son alongside the Bushes children

What has happened is that the Wikipedia entry on Abraham Lincoln was stripped of its context and reduced to two relationships: a) Abraham Lincoln was a president of the US, and b) Abraham Lincoln had children. When the same is done for the rest of the Wikipedia, then all entries with the attribute «American president» and «has children» can be combined. Like in a database, you can match all X which have an attribute Y (e.g. a school) – without even having to give this attribute Y a value (e.g. the name of a certain school).

The examples shows that the de-contextualization (or reduction of Robert Todd Lincoln to «child of US president») has great potential in terms of creating new associations of information. A number of Firefox extensions make use of this and combine information about books, films etc. with people who have visited the according web sites (Glue) or link terms to other sources according to their classificaton as person, company, country etc. (Gnosis). The re-contextualization, however, is tricky. If too much context is suppressed, the results become ridiculous, as the link from Abraham Lincoln (correctly identified as a person) to LinkedIn.

Gnosis' links for Abraham Lincoln

Also, the ability of machines to categorize information still seems to be very limited. Glue (called «BlueOrganizer» when I took the screenshot), for instance, can only recognize books on websites such as Amazon or Barnes & Nobles, but not on the Library of Congress’ site. And the insight of recognizing a category comes across to the human user as rather silly.

Blue Organizer recognizes that you're looking at a book!

This lack cannot be attributed to the absence of metadata. Zotero (mentioned earlier on this blog), for instance, manages to identify books on many more sites, of libraries as well as on-line bookshops. It just shows how much effort is necessary to recognize structure on the web as such. This is exactly what the top-down approach achieves: It identifies concrete questions (e.g. who else read this book?) as well as sources which might be relevant to answer them (e.g. sites which provide links from a book to people who have read it) and then creates new relationships between these data. This is a much more selective approach than the bottom-up one, and I believe it is more promising. The bottom-up approach either doesn’t go beyond today’s databases or it aims at the maximum amount of options for future relationships. The bottom-up approach is too unspecific to make it attractive and there are too many uncertainties about its usefulness. This, I belive, is the main reason why it isn’t being adopted. For reasons of semantics, I also doubt it will be possible to achieve useful results by automatically mapping data from heterogeneous sources via generic ontologies in the near future (see the above example of Abraham Lincoln in LinkedIn).

Maybe the bottom-up approach should be viewed without rigid technical requirements, but more openly, in the sense of carefully produced content which

embeds information in its context (e.g. origin, use, related information) and/or
structures content within a certain scope by using (more or less) standardized elements.

Standardized display of a city on Wikipedia

This leads me back to my initial question: Are metadata good for findability? The semantic web, at least at present, is another example that adding descriptors is neither popular with content producers nor seems to be considered crucial for automated processing. But metadata aren’t restricted to descriptors which are added to content; the features mentioned above are pure metadata and absolutely essential for creating structured search. And structured search, or variations of it like faceted navigation or or the semantic web applications mentioned above, have a great advantage over pattern-matching search engines. Instead of only being able to look for X, it is possible to look for something you do not exactly know except that it should have the characteristic Y.

Carefully structured, metadata-rich information may not be crucial for search engines which rely on pattern-matching. But it opens new perspectives for exploring related content, for finding answers without exactly knowing the question.

Recontextualization of metadata – part 2: databases

Published October 24, 2008 Search , Taxonomy Leave a Comment

In a previous post, I asked the question under which circumstances descriptors can be powerful for findability. The classical example are databases – the attributes to the objects are nothing else but descriptors, and highly standardized ones at that. The database itself delivers the context for the objects it contains: it is made expressly for certain data and for a certain reason, thus providing clear and reliable context.

But if describing objects in a consistent way is such a drag, why would anyone bother to do it? Or to put it the other way round: Is this a way to make structuring information more intuitive, simple and pleasurable to use?

Let’s take a more detailed look at the first part of the question: Why does anyone bother about describing objects in a database? Because objects in databases are similar to each other and detailed descriptions of the objects makes them comparable. In a database for flights, e.g. Kayak, you can compare all flights with a common origin and destination, e.g. from Amsterdam to Zurich, or you might want to find out how far 1000 Swiss francs could get you (though I haven’t yet seen this implemented). The drive for providers of databases or vertical search engines to make the effort of describing objects is, on the one hand, to offer the largest range of data, and on the other hand, to make them as easily comparable as possible to meet consumers’ needs.

Flight Amsterdam - Zurich on kayak.com

Concerning the second part of the question, do databases make structuring information more intuitive, simple and pleasurable to use? The answer is no. Anyone who has ever been involved in aggregating structured data from different sources will tell you what back-breaking work it is. This even holds true for highly standardized environments as international aviation. The relative homogeneity of the data described in a database may make the task of consistent description simpler, but the drawback is that databases are rigid. Conventions need to be agreed on to make consistent description possible, and they reduce the objects to the chosen characteristics.

The rigidity of databases means they are not well adapted to change. This is very well argued by Hank Williams in his post The Death of the Relational Database (why an eloquent writer would choose the title Why does everything suck? for his blog is beyond a square like myself, but it’s worth the read):

As long as you don’t want to radically change or expand the scope of what you are doing, relational databases are great. But knowledge is an ever-expanding universe of objects and relationships between them. The relational database doesn’t handle that use case very well.

In my last post as well as in this one, I argued that the consistent use of descriptors is only possible in very tight limits and that in order to take complexity – and change – into account, more flexible concepts are necessary. Is the Semantic Web an answer to that? As fascinating as the prospect sounds, is it really possible to detach relations from their context? The third and final part of this series shall look into this.

Recontextualization of metadata – part 1: keywords & classifications

Published October 4, 2008 Search , Taxonomy 11 Comments

In the last post, I pondered the question if descriptors can be powerful for findability. I’ll avoid defining what findability is and discuss a few examples instead.

Example 1: Libraries

Descriptors are heavily used in libraries, but do they improve findability in practice? A search in two different university library catalogues (Basle and Zurich) for the interdisciplinary subject of the history of chemistry in Switzerland returned but two matches for about 30 titles in each library. It is noteworthy that both libraries seem to possess the same books (I did a series of cross-checks) but deliver different search results for the same query.

The search was conducted overall with the keywords of «geschichte schweiz chemie» (history, Switzerland, chemistry). The screenshots of the keywords for an identical book (which I found only in one library with the keywords mentioned) show why the results are so different:

Keywords for Tobias Straumann’s «Die Schöpfung im Reagenzglas» in Basle’s university library catalogue

Keywords for the same title in Zurich’s university library catalogue

The keywords were helpful to find relevant works in both libraries, because the titles in their majority are totally useless for a subject search. Until recently, no Swiss library included any contextual information such as summaries or tables of contents into their search. However, the quality of the results of the search are dubious. The difficulties are very much the same as for search in unstructured data. The keywords are either too broad or too narrow, the controlled vocabularies are not flexible enough to account for all possible variations, and the human factor leads to inconsistent use of descriptors. Furthermore, it is impossible to reconstruct the reason for the divergence of the results.

Example 2: Classifications of Swiss Legislation

To make periodically published and revised legislation accessible by subject, it is filed in a classification. Even though in Switzerland the federal, cantonal (state) and communal jurisdictions are sovereign, the classifications of legislation are similar in large parts of the country and over all levels of state. The classifications were developed long before the Web and have been adapted since. Primary access is by browsing alphabetically or by subject.

Alphabetical index of the Compilation of Federal Legislation

The alphabetical index is a highly elaborated system of cross reference. It helps with disambiguation and points to results in less obvious categories.

Classification of the Swiss Confederation (corresponding excerpt)

Classification of the Swiss Confederation (excerpt)Classification of the canton of Basel-Stadt (corresponding excerpt)

Access by subjects gives the user an idea of the scope of the collection. The hierarchical approach allows him or her to quickly drill down to the relevant subject.

To find the corresponding legislation over different levels of jurisdiction, a human user can easily find the resemblances between similar classifications. For automatic aggregation, this is more of an obstacle. Therefore, the Swiss organization for e-government standards, e.ch, has published classifications for e-government subjects (in German).

Rather surprisingly, the standard is not used consistently in the cantons cited as best practice examples (e.g. compare the cantons of Aargau and Basel-Stadt), and the synonyms and related terms recommended for the improvement of findability are not being used.

eCH standard

Version of Canton Aargau

Conclusions

Keywords improve findability when there is no other information available, e.g. in library catalogues without digitized content. However, keywords are only reliable when the mechanism of attributing them is transparent. «Intellectual control» – the term archivists use for creating tools to access their records – is not only important for those who create the tools, but just as well for those who use them in order to gain access. Lack of transparency means loss of control, and even when laid open, systems of cross-reference can be confusing for the user, time-consuming to maintain and error-prone.

The other example of the classification of legislation shows that transparency can at least partly compensate for lack of control. This should not be an excuse for ignoring existing standards which are primarily created for the sake of interoperability and to facilitate maintenance. But the examples do show that from a certain level of detail, control gets out of hand. The objects described often have too many facets to be consistently described – or to put it the other way round, it is difficult to create comprehensive classification systems. Thus, at a deeper level, classifications can easily shift from an asset to a risk, because results tend to become unreliable. In the digital world, I believe, it makes more sense to abandon descriptors at that level and to retreat to methods of natural language processing.

Metadata for Findability

Published September 8, 2008 Search , Taxonomy 26 Comments

Does it make sense to add metadata to information for the sake of findability? Historically, it definitely did. Before the age of digitally available text and full text search, there was no other means to provide access to large amounts of information. Librarians filed books under their author, title, keywords etc. Today, descriptors – keywords, subject headings, tags etc. – seem to have difficulties making the transition to the 21^st century. Is this justified? Let’s take a look at some examples I noted recently which doubted the effectiveness of descriptors.

Kumarkamal on Basics of Search Engine Optimization warns «Don’t waste your time on meta tags», pointing out that search engines largely ignore descriptors.

Mathew Ingram raises the question «Who bookmarks anymore?» He states that, instead of consulting his bookmarks, «[i]f I’m writing about something and I remember some details, I type them into Google and eventually track the page down.» In a review of the social bookmarking service Qitera, tech blog netzwertig even claims that «[Schlagwörter] sind nämlich im Prinzip nicht mehr als ein Ausgleich für den Mangel an intelligenten Suchmöglichkeiten», i.e. in principle, keywords are nothing but a compensation for the lack of possiblities for intelligent search.

What are the points at issue? Search engines ignore descriptors because of bias. Descriptors added by humans are biased – almost necessarily so. To give the gist of a text, you need to focus on some parts and omit others. In addition to this «immanent» bias, descriptors are often selected to influence search engine ranking. This leads to less reliable search results. Google first became famous because it developed algorithms to encounter exactly this problem: Google’s ranking was based on reliability. Links to a page were interpreted as an indication that a site provided valuable and trustworthy information. Google introduced taking the context of a site into account in its search algorithm. But descriptors, being metadata, i.e. data «outside» the data, lack context, or rather, their context is not taken into account for by search engines.

The second example, on the other hand, is that of a highly contextual search. Ingram remembers the item he is looking for («I remember some details»), can formulate a precise query and easily decide which is the document he has been looking for. With all this information given, typing keywords into Google is a highly efficient search strategy bound to retrieve excellent results – why bother to reduce the document to keywords? In addition to that, searching for keywords would most probably return too many documents in this case. The strength of descriptors is recall, not precision. Again, this particularly holds true for descriptors lacking context. Of course, this doesn’t apply to all descriptors, but human language tends to be ambiguous without context.

In order to deliver useful results for search, descriptors need to be re-contextualized. Basically, there are two ways of including context: adding information or reducing information to a well-defined scope. The first is what machines do well. Search engines can analyze vast amounts of information, find patterns, match items etc. Human beings usually prefer the second method: They reduce complexity by resorting to subjects with (more or less) clear outlines, and particularly to reliable environments. If the creator of a record is known and trusted, or if there is sufficient evidence that this is the case (e.g. by recommendation of a trusted person), then a record’s content can be taken to be trustworthy. The reduction of the data taken into account for a search to a trusted environment greatly improves precision. The price of diminished recall is negligable in this case. Of course, the bias mentioned previously can play a role, but through knowledge of its context, it can be put into perspective.

In the following posts, I’d like to look at some examples of the use of descriptors which take context into account, in delimited areas as well as by analysis of additional data. What I won’t be doing – though I did plan this initially – is to compare the effectiveness of full text search and search in metadata. I did find a few indications in library science literature – none of them conclusive – and I believe now that the comparison is not a legitimate one. Too much depends on what a user is looking for and which data he searches in, and how well he is familiar with particular sources. So the question is simply if descriptors can be powerful for findability, and under which circumstances.

Search Engine Tips

Published July 20, 2008 Search 1 Comment

Hakia, a semantic search engine, uses Yahoo’s BOSS and does a good job at neatly clustering results (e.g. try a search for a country) and also has a reasonable ranking – something many new search engines don’t manage well. By contrast, the much-hyped Powerset, another recent semantic search engine, is simply disappointing.

Cluuz aims at graphically clustering search results. So far, I have been sceptical towards visualizations of search results, because the mind-map kind of graphs (e.g. in Thinkmap or Quintura) are either rather trivial or useless. But Cluuz (or similarly, KartOO), cluster their sources rather than semantic characteristics. Not that this is entirely new – Clusty (previously Vivisimo; I’ve used the example in the post on information literacy vs usability) does it in form of a regular list – but the focus on sources is one I much appreciate, and showing relationships between sources might prove promising.

A last tip for today: While searchme‘s sequential presentation of search results is too playful for my liking, because I need a quick overview to make my choice what I want to take a closer look at. However a really nice feature is the integration of stacks which allows you to create visual bookmarks – something completely different, because it is not about seeking new, but rather showing known objects (cf Theresaneil’s extensive post on seek or show).

Metadata revival?

Published June 7, 2008 Search , Semantic Web 2 Comments

Metadata is a big thing with archivists and other people concerned with context, but I must admit that in all my professional years, I have never worked on a web project which actually used the Dublin Core Metadata set. The most probable reason that people don’t seem to bother much about metadata – at least in a standardized form – is that popular search engines don’t seem to take them into account. Or at least they didn’t until recently.

Let’s have a look at a (very!) brief history of search result designs:

Search engines I rememer from my early web experience returned results looking somewhat like this (our government portal hasn’t made much progress since):

Then up comes Google and introduces a design which has become pretty much standard:

$Google Search for \$

More recently, both Google and Yahoo have started introducing structured search results:

Search results are also increasingly shown in clusters based mainly on format (text in general, news, entries from encyclopedias; images, video etc.):

$Yahoo India \$

So adding metadata in some kind of standardized form does seem to be a recent trend for

clustering search results and
displaying search results.

Metadata provided by the creators of web sites are used for these displays. However, these metadata are explicitly not used for search algorithms, as an article on Yahoo and the Future of Search reports. Metadata provided by the creators tends to bias the outcome, and the analysis of broader text corpus by powerful search engines provides more signifcant results than metadata out of context.

Still, the increased use of metadata are pointing to interesting directions:

Search results are becoming more context-sensitive. Metadata help the user to choose the appropriate context, e.g. for disambiguation or clarification of a query. Search interfaces are taking the iterative nature of search into account and getting closer to the process of questions and answers users require to clarify their needs.
Possible actions after having found the desired content are beginning to be transferred to the search sites (search engines becoming portals may – or may not – be part of the development). Users can view details, maps or reviews, check opening hours, buy tickets or conduct site-search without having to leave the search results page. This is enabled by a deeper integration of applications into results.

Site-search from search results page

Information Access