I admit it took me over 7 years to notice this, but strictly speaking, the metaphor of the breadcrumb trail is totally absurd. The breadcrumb navigation is supposed to lead the user back along the way she or he came. But – remember the fairy tale – the trail of breadcrumbs did NOT help Hänsel and Gretel to find their way home because the birds had eaten all the crumbs. Their first version of scattering pebbles, however, fulfilled the task very adequately.
Which role do metadata play in the semantic web? The latter aims at structuring the contents of the web in a way which makes it possible for machines to find and combine information. At present, there are two approaches which lead to this goal, one bottom-up and one top-down. In short, the bottom-up approach depends on metadata being added to web pages by content providers while in the top-down approach a third party crawls the web or a selection of sites and automatically generates a set of metadata for certain pages.
Today, a number of semantic web applications give a glimpse into the prospects of what can be done with highly structured data. It is interesting to note that all examples come from projects embracing the top-down approach, not from the bottom-up approach which asks for systematic assignment of metadata. I’d like to take a closer look at Freebase Parallax. The power of Parallax is to aggregate information from different sources and make it visible on one screen – in a parallel way (hence the name, I assume) instead of the usual sequential succession of sources. The data Parallax uses are mainly from the Wikipedia, so don’t expect any groundbreaking new facts. But it’s still quite an awesome application, at least when you stick to the examples used in the demo video. (When experimenting yourself, it is rather difficult to find good examples, but there are a few more on Infobib (in German).)
The first example from the demo video is a search for Abraham Lincoln’s children and shows how Parallax can arrange all four entries on one page instead of the user having to navigate to each single page.
In a next step, the narrator of the video expands his search to all children of all American presidents, showing Robert Todd Lincoln alongside with children from George Bush, Sr. and Jr., and then goes on to find all schools American presidents’ children attended.
What has happened is that the Wikipedia entry on Abraham Lincoln was stripped of its context and reduced to two relationships: a) Abraham Lincoln was a president of the US, and b) Abraham Lincoln had children. When the same is done for the rest of the Wikipedia, then all entries with the attribute «American president» and «has children» can be combined. Like in a database, you can match all X which have an attribute Y (e.g. a school) – without even having to give this attribute Y a value (e.g. the name of a certain school).
The examples shows that the de-contextualization (or reduction of Robert Todd Lincoln to «child of US president») has great potential in terms of creating new associations of information. A number of Firefox extensions make use of this and combine information about books, films etc. with people who have visited the according web sites (Glue) or link terms to other sources according to their classificaton as person, company, country etc. (Gnosis). The re-contextualization, however, is tricky. If too much context is suppressed, the results become ridiculous, as the link from Abraham Lincoln (correctly identified as a person) to LinkedIn.
Also, the ability of machines to categorize information still seems to be very limited. Glue (called «BlueOrganizer» when I took the screenshot), for instance, can only recognize books on websites such as Amazon or Barnes & Nobles, but not on the Library of Congress’ site. And the insight of recognizing a category comes across to the human user as rather silly.
This lack cannot be attributed to the absence of metadata. Zotero (mentioned earlier on this blog), for instance, manages to identify books on many more sites, of libraries as well as on-line bookshops. It just shows how much effort is necessary to recognize structure on the web as such. This is exactly what the top-down approach achieves: It identifies concrete questions (e.g. who else read this book?) as well as sources which might be relevant to answer them (e.g. sites which provide links from a book to people who have read it) and then creates new relationships between these data. This is a much more selective approach than the bottom-up one, and I believe it is more promising. The bottom-up approach either doesn’t go beyond today’s databases or it aims at the maximum amount of options for future relationships. The bottom-up approach is too unspecific to make it attractive and there are too many uncertainties about its usefulness. This, I belive, is the main reason why it isn’t being adopted. For reasons of semantics, I also doubt it will be possible to achieve useful results by automatically mapping data from heterogeneous sources via generic ontologies in the near future (see the above example of Abraham Lincoln in LinkedIn).
Maybe the bottom-up approach should be viewed without rigid technical requirements, but more openly, in the sense of carefully produced content which
- embeds information in its context (e.g. origin, use, related information) and/or
- structures content within a certain scope by using (more or less) standardized elements.
This leads me back to my initial question: Are metadata good for findability? The semantic web, at least at present, is another example that adding descriptors is neither popular with content producers nor seems to be considered crucial for automated processing. But metadata aren’t restricted to descriptors which are added to content; the features mentioned above are pure metadata and absolutely essential for creating structured search. And structured search, or variations of it like faceted navigation or or the semantic web applications mentioned above, have a great advantage over pattern-matching search engines. Instead of only being able to look for X, it is possible to look for something you do not exactly know except that it should have the characteristic Y.
Carefully structured, metadata-rich information may not be crucial for search engines which rely on pattern-matching. But it opens new perspectives for exploring related content, for finding answers without exactly knowing the question.
In a previous post, I asked the question under which circumstances descriptors can be powerful for findability. The classical example are databases – the attributes to the objects are nothing else but descriptors, and highly standardized ones at that. The database itself delivers the context for the objects it contains: it is made expressly for certain data and for a certain reason, thus providing clear and reliable context.
But if describing objects in a consistent way is such a drag, why would anyone bother to do it? Or to put it the other way round: Is this a way to make structuring information more intuitive, simple and pleasurable to use?
Let’s take a more detailed look at the first part of the question: Why does anyone bother about describing objects in a database? Because objects in databases are similar to each other and detailed descriptions of the objects makes them comparable. In a database for flights, e.g. Kayak, you can compare all flights with a common origin and destination, e.g. from Amsterdam to Zurich, or you might want to find out how far 1000 Swiss francs could get you (though I haven’t yet seen this implemented). The drive for providers of databases or vertical search engines to make the effort of describing objects is, on the one hand, to offer the largest range of data, and on the other hand, to make them as easily comparable as possible to meet consumers’ needs.
Concerning the second part of the question, do databases make structuring information more intuitive, simple and pleasurable to use? The answer is no. Anyone who has ever been involved in aggregating structured data from different sources will tell you what back-breaking work it is. This even holds true for highly standardized environments as international aviation. The relative homogeneity of the data described in a database may make the task of consistent description simpler, but the drawback is that databases are rigid. Conventions need to be agreed on to make consistent description possible, and they reduce the objects to the chosen characteristics.
The rigidity of databases means they are not well adapted to change. This is very well argued by Hank Williams in his post The Death of the Relational Database (why an eloquent writer would choose the title Why does everything suck? for his blog is beyond a square like myself, but it’s worth the read):
As long as you don’t want to radically change or expand the scope of what you are doing, relational databases are great. But knowledge is an ever-expanding universe of objects and relationships between them. The relational database doesn’t handle that use case very well.
In my last post as well as in this one, I argued that the consistent use of descriptors is only possible in very tight limits and that in order to take complexity – and change – into account, more flexible concepts are necessary. Is the Semantic Web an answer to that? As fascinating as the prospect sounds, is it really possible to detach relations from their context? The third and final part of this series shall look into this.
In the last post, I pondered the question if descriptors can be powerful for findability. I’ll avoid defining what findability is and discuss a few examples instead.
Example 1: Libraries
Descriptors are heavily used in libraries, but do they improve findability in practice? A search in two different university library catalogues (Basle and Zurich) for the interdisciplinary subject of the history of chemistry in Switzerland returned but two matches for about 30 titles in each library. It is noteworthy that both libraries seem to possess the same books (I did a series of cross-checks) but deliver different search results for the same query.
The search was conducted overall with the keywords of «geschichte schweiz chemie» (history, Switzerland, chemistry). The screenshots of the keywords for an identical book (which I found only in one library with the keywords mentioned) show why the results are so different:
The keywords were helpful to find relevant works in both libraries, because the titles in their majority are totally useless for a subject search. Until recently, no Swiss library included any contextual information such as summaries or tables of contents into their search. However, the quality of the results of the search are dubious. The difficulties are very much the same as for search in unstructured data. The keywords are either too broad or too narrow, the controlled vocabularies are not flexible enough to account for all possible variations, and the human factor leads to inconsistent use of descriptors. Furthermore, it is impossible to reconstruct the reason for the divergence of the results.
Example 2: Classifications of Swiss Legislation
To make periodically published and revised legislation accessible by subject, it is filed in a classification. Even though in Switzerland the federal, cantonal (state) and communal jurisdictions are sovereign, the classifications of legislation are similar in large parts of the country and over all levels of state. The classifications were developed long before the Web and have been adapted since. Primary access is by browsing alphabetically or by subject.
The alphabetical index is a highly elaborated system of cross reference. It helps with disambiguation and points to results in less obvious categories.
Access by subjects gives the user an idea of the scope of the collection. The hierarchical approach allows him or her to quickly drill down to the relevant subject.
To find the corresponding legislation over different levels of jurisdiction, a human user can easily find the resemblances between similar classifications. For automatic aggregation, this is more of an obstacle. Therefore, the Swiss organization for e-government standards, e.ch, has published classifications for e-government subjects (in German).
Rather surprisingly, the standard is not used consistently in the cantons cited as best practice examples (e.g. compare the cantons of Aargau and Basel-Stadt), and the synonyms and related terms recommended for the improvement of findability are not being used.
Keywords improve findability when there is no other information available, e.g. in library catalogues without digitized content. However, keywords are only reliable when the mechanism of attributing them is transparent. «Intellectual control» – the term archivists use for creating tools to access their records – is not only important for those who create the tools, but just as well for those who use them in order to gain access. Lack of transparency means loss of control, and even when laid open, systems of cross-reference can be confusing for the user, time-consuming to maintain and error-prone.
The other example of the classification of legislation shows that transparency can at least partly compensate for lack of control. This should not be an excuse for ignoring existing standards which are primarily created for the sake of interoperability and to facilitate maintenance. But the examples do show that from a certain level of detail, control gets out of hand. The objects described often have too many facets to be consistently described – or to put it the other way round, it is difficult to create comprehensive classification systems. Thus, at a deeper level, classifications can easily shift from an asset to a risk, because results tend to become unreliable. In the digital world, I believe, it makes more sense to abandon descriptors at that level and to retreat to methods of natural language processing.
Does it make sense to add metadata to information for the sake of findability? Historically, it definitely did. Before the age of digitally available text and full text search, there was no other means to provide access to large amounts of information. Librarians filed books under their author, title, keywords etc. Today, descriptors – keywords, subject headings, tags etc. – seem to have difficulties making the transition to the 21st century. Is this justified? Let’s take a look at some examples I noted recently which doubted the effectiveness of descriptors.
Kumarkamal on Basics of Search Engine Optimization warns «Don’t waste your time on meta tags», pointing out that search engines largely ignore descriptors.
Mathew Ingram raises the question «Who bookmarks anymore?» He states that, instead of consulting his bookmarks, «[i]f I’m writing about something and I remember some details, I type them into Google and eventually track the page down.» In a review of the social bookmarking service Qitera, tech blog netzwertig even claims that «[Schlagwörter] sind nämlich im Prinzip nicht mehr als ein Ausgleich für den Mangel an intelligenten Suchmöglichkeiten», i.e. in principle, keywords are nothing but a compensation for the lack of possiblities for intelligent search.
What are the points at issue? Search engines ignore descriptors because of bias. Descriptors added by humans are biased – almost necessarily so. To give the gist of a text, you need to focus on some parts and omit others. In addition to this «immanent» bias, descriptors are often selected to influence search engine ranking. This leads to less reliable search results. Google first became famous because it developed algorithms to encounter exactly this problem: Google’s ranking was based on reliability. Links to a page were interpreted as an indication that a site provided valuable and trustworthy information. Google introduced taking the context of a site into account in its search algorithm. But descriptors, being metadata, i.e. data «outside» the data, lack context, or rather, their context is not taken into account for by search engines.
The second example, on the other hand, is that of a highly contextual search. Ingram remembers the item he is looking for («I remember some details»), can formulate a precise query and easily decide which is the document he has been looking for. With all this information given, typing keywords into Google is a highly efficient search strategy bound to retrieve excellent results – why bother to reduce the document to keywords? In addition to that, searching for keywords would most probably return too many documents in this case. The strength of descriptors is recall, not precision. Again, this particularly holds true for descriptors lacking context. Of course, this doesn’t apply to all descriptors, but human language tends to be ambiguous without context.
In order to deliver useful results for search, descriptors need to be re-contextualized. Basically, there are two ways of including context: adding information or reducing information to a well-defined scope. The first is what machines do well. Search engines can analyze vast amounts of information, find patterns, match items etc. Human beings usually prefer the second method: They reduce complexity by resorting to subjects with (more or less) clear outlines, and particularly to reliable environments. If the creator of a record is known and trusted, or if there is sufficient evidence that this is the case (e.g. by recommendation of a trusted person), then a record’s content can be taken to be trustworthy. The reduction of the data taken into account for a search to a trusted environment greatly improves precision. The price of diminished recall is negligable in this case. Of course, the bias mentioned previously can play a role, but through knowledge of its context, it can be put into perspective.
In the following posts, I’d like to look at some examples of the use of descriptors which take context into account, in delimited areas as well as by analysis of additional data. What I won’t be doing – though I did plan this initially – is to compare the effectiveness of full text search and search in metadata. I did find a few indications in library science literature – none of them conclusive – and I believe now that the comparison is not a legitimate one. Too much depends on what a user is looking for and which data he searches in, and how well he is familiar with particular sources. So the question is simply if descriptors can be powerful for findability, and under which circumstances.
Seth Godin did a nice gloss on the secondary importance of tools for (information) architecture: I need to build a house, what kind of hammer should I buy?