Archive for the 'Taxonomy' Category

Recontextualization of metadata – part 2: databases

In a previous post, I asked the question under which circumstances descriptors can be powerful for findability. The classical example are databases – the attributes to the objects are nothing else but descriptors, and highly standardized ones at that. The database itself delivers the context for the objects it contains: it is made expressly for certain data and for a certain reason, thus providing clear and reliable context.

But if describing objects in a consistent way is such a drag, why would anyone bother to do it? Or to put it the other way round: Is this a way to make structuring information more intuitive, simple and pleasurable to use?

Let’s take a more detailed look at the first part of the question: Why does anyone bother about describing objects in a database? Because objects in databases are similar to each other and detailed descriptions of the objects makes them comparable. In a database for flights, e.g. Kayak, you can compare all flights with a common origin and destination, e.g. from Amsterdam to Zurich, or you might want to find out how far 1000 Swiss francs could get you (though I haven’t yet seen this implemented). The drive for providers of databases or vertical search engines to make the effort of describing objects is, on the one hand, to offer the largest range of data, and on the other hand, to make them as easily comparable as possible to meet consumers’ needs.

Flight Amsterdam - Zurich on kayak.com

Flight Amsterdam - Zurich on kayak.com

Concerning the second part of the question, do databases make structuring information more intuitive, simple and pleasurable to use? The answer is no. Anyone who has ever been involved in aggregating structured data from different sources will tell you what back-breaking work it is. This even holds true for highly standardized environments as international aviation. The relative homogeneity of the data described in a database may make the task of consistent description simpler, but the drawback is that databases are rigid. Conventions need to be agreed on to make consistent description possible, and they reduce the objects to the chosen characteristics.

The rigidity of databases means they are not well adapted to change. This is very well argued by Hank Williams in his post The Death of the Relational Database (why an eloquent writer would choose the title Why does everything suck? for his blog is beyond a square like myself, but it’s worth the read):

As long as you don’t want to radically change or expand the scope of what you are doing, relational databases are great. But knowledge is an ever-expanding universe of objects and relationships between them. The relational database doesn’t handle that use case very well.

In my last post as well as in this one, I argued that the consistent use of descriptors is only possible in very tight limits and that in order to take complexity – and change – into account, more flexible concepts are necessary. Is the Semantic Web an answer to that? As fascinating as the prospect sounds, is it really possible to detach relations from their context?  The third and final part of this series shall look into this.

Advertisements

Recontextualization of metadata – part 1: keywords & classifications

In the last post, I pondered the question if descriptors can be powerful for findability. I’ll avoid defining what findability is and discuss a few examples instead.

Example 1: Libraries

Descriptors are heavily used in libraries, but do they improve findability in practice? A search in two different university library catalogues (Basle and Zurich) for the interdisciplinary subject of the history of chemistry in Switzerland returned but two matches for about 30 titles in each library. It is noteworthy that both libraries seem to possess the same books (I did a series of cross-checks) but deliver different search results for the same query.

The search was conducted overall with the keywords of «geschichte schweiz chemie» (history, Switzerland, chemistry). The screenshots of the keywords for an identical book (which I found only in one library with the keywords mentioned) show why the results are so different:

Keywords for Tobias Straumann’s «Die Schöpfung im Reagenzglas» in Basle’s university library catalogue

Keywords for Tobias Straumann’s «Die Schöpfung im Reagenzglas» in Basle’s university library catalogue

Keywords for the same title in Zurich’s university library catalogue

Keywords for the same title in Zurich’s university library catalogue

The keywords were helpful to find relevant works in both libraries, because the titles in their majority are totally useless for a subject search. Until recently, no Swiss library included any contextual information such as summaries or tables of contents into their search. However, the quality of the results of the search are dubious. The difficulties are very much the same as for search in unstructured data. The keywords are either too broad or too narrow, the controlled vocabularies are not flexible enough to account for all possible variations, and the human factor leads to inconsistent use of descriptors. Furthermore, it is impossible to reconstruct the reason for the divergence of the results.

Example 2: Classifications of Swiss Legislation

To make periodically published and revised legislation accessible by subject, it is filed in a classification. Even though in Switzerland the federal, cantonal (state) and communal jurisdictions are sovereign, the classifications of legislation are similar in large parts of the country and over all levels of state. The classifications were developed long before the Web and have been adapted since. Primary access is by browsing alphabetically or by subject.

Alphabetical index of the Compilation of Federal Legislation

Alphabetical index of the Compilation of Federal Legislation

The alphabetical index is a highly elaborated system of cross reference. It helps with disambiguation and points to results in less obvious categories.

Classification of the Swiss Confederation (corresponding excerpt)

Classification of the Swiss Confederation (excerpt)Classification of the canton of Basel-Stadt (corresponding excerpt)

Access by subjects gives the user an idea of the scope of the collection. The hierarchical approach allows him or her to quickly drill down to the relevant subject.

To find the corresponding legislation over different levels of jurisdiction, a human user can easily find the resemblances between similar classifications. For automatic aggregation, this is more of an obstacle. Therefore, the Swiss organization for e-government standards, e.ch, has published classifications for e-government subjects (in German).

Rather surprisingly, the standard is not used consistently in the cantons cited as best practice examples (e.g. compare the cantons of Aargau and Basel-Stadt), and the synonyms and related terms recommended for the improvement of findability are not being used.

eCH standard

eCH standard

Version of Canton Aargau

Version of Canton Aargau

Conclusions

Keywords improve findability when there is no other information available, e.g. in library catalogues without digitized content. However, keywords are only reliable when the mechanism of attributing them is transparent. «Intellectual control» – the term archivists use for creating tools to access their records – is not only important for those who create the tools, but just as well for those who use them in order to gain access. Lack of transparency means loss of control, and even when laid open, systems of cross-reference can be confusing for the user, time-consuming to maintain and error-prone.

The other example of the classification of legislation shows that transparency can at least partly compensate for lack of control. This should not be an excuse for ignoring existing standards which are primarily created for the sake of interoperability and to facilitate maintenance. But the examples do show that from a certain level of detail, control gets out of hand. The objects described often have too many facets to be consistently described – or to put it the other way round, it is difficult to create comprehensive classification systems. Thus, at a deeper level, classifications can easily shift from an asset to a risk, because results tend to become unreliable. In the digital world, I believe, it makes more sense to abandon descriptors at that level and to retreat to methods of natural language processing.


Metadata for Findability

Does it make sense to add metadata to information for the sake of findability? Historically, it definitely did. Before the age of digitally available text and full text search, there was no other means to provide access to large amounts of information. Librarians filed books under their author, title, keywords etc. Today, descriptors – keywords, subject headings, tags etc. – seem to have difficulties making the transition to the 21st century. Is this justified? Let’s take a look at some examples I noted recently which doubted the effectiveness of descriptors.

Kumarkamal on Basics of Search Engine Optimization warns «Don’t waste your time on meta tags», pointing out that search engines largely ignore descriptors.

Mathew Ingram raises the question «Who bookmarks anymore?» He states that, instead of consulting his bookmarks, «[i]f I’m writing about something and I remember some details, I type them into Google and eventually track the page down.» In a review of the social bookmarking service Qitera, tech blog netzwertig even claims that «[Schlagwörter] sind nämlich im Prinzip nicht mehr als ein Ausgleich für den Mangel an intelligenten Suchmöglichkeiten», i.e. in principle, keywords are nothing but a compensation for the lack of possiblities for intelligent search.

What are the points at issue? Search engines ignore descriptors because of bias. Descriptors added by humans are biased – almost necessarily so. To give the gist of a text, you need to focus on some parts and omit others. In addition to this «immanent» bias, descriptors are often selected to influence search engine ranking. This leads to less reliable search results. Google first became famous because it developed algorithms to encounter exactly this problem: Google’s ranking was based on reliability. Links to a page were interpreted as an indication that a site provided valuable and trustworthy information. Google introduced taking the context of a site into account in its search algorithm. But descriptors, being metadata, i.e. data «outside» the data, lack context, or rather, their context is not taken into account for by search engines.

The second example, on the other hand, is that of a highly contextual search. Ingram remembers the item he is looking for («I remember some details»), can formulate a precise query and easily decide which is the document he has been looking for. With all this information given, typing keywords into Google is a highly efficient search strategy bound to retrieve excellent results – why bother to reduce the document to keywords? In addition to that, searching for keywords would most probably return too many documents in this case. The strength of descriptors is recall, not precision. Again, this particularly holds true for descriptors lacking context. Of course, this doesn’t apply to all descriptors, but human language tends to be ambiguous without context.

In order to deliver useful results for search, descriptors need to be re-contextualized. Basically, there are two ways of including context: adding information or reducing information to a well-defined scope. The first is what machines do well. Search engines can analyze vast amounts of information, find patterns, match items etc. Human beings usually prefer the second method: They reduce complexity by resorting to subjects with (more or less) clear outlines, and particularly to reliable environments. If the creator of a record is known and trusted, or if there is sufficient evidence that this is the case (e.g. by recommendation of a trusted person), then a record’s content can be taken to be trustworthy. The reduction of the data taken into account for a search to a trusted environment greatly improves precision. The price of diminished recall is negligable in this case. Of course, the bias mentioned previously can play a role, but through knowledge of its context, it can be put into perspective.

In the following posts, I’d like to look at some examples of the use of descriptors which take context into account, in delimited areas as well as by analysis of additional data. What I won’t be doing – though I did plan this initially – is to compare the effectiveness of full text search and search in metadata. I did find a few indications in library science literature – none of them conclusive – and I believe now that the comparison is not a legitimate one. Too much depends on what a user is looking for and which data he searches in, and how well he is familiar with particular sources. So the question is simply if descriptors can be powerful for findability, and under which circumstances.

One of these things is not like the other

Recently, an ad campaign reminded me of the books which teach young children to cluster concepts. The kinds learn to find mutual characteristics of a number of objects, and to eliminate the object which is different.

Riddle

Creating categories by naming a difference from the rest is popular, but in information architectue, it’s often not acceptable. All objects need to be located SOMEplace in the structure. They need to “live” somewhere, as a customer recently phrased it (very nicely adopting the information architecture metaphor). In archives, the most important files are usually classified under “general” or “miscellaneous” because no-one bothered to create an adequate category for something happening outside the daily routine. In web projects, naming the residue categories (or breaking them down to well-defined clusters) is usually a time-consuming challenge and for the most part not well solved.

This is where tags come in. They reverse the process:

  1. Instead of creating a category based on mutual features, and giving that category an understandable name, the object itself is described by its distinct features.
  2. Tags are attributed to the objects by the users, not by information architects.

In the example of the riddle shown above, the kangaroo is usually picked as the odd one out because it’s a marsupial. Finding this distinct feature is actually easier than finding the mutual category of the other animals and then having to find out if the category of marsupials is similar to that of mammals or not.

Also, as a discussion of the riddle shows, the solution is not entirely unambiguous. Deer and kangaroo are usually found wild, the kangaroo lives only in Australia (or in zoos) etc. Tags are convenient to describe these attributes without having to create an appropriate category. They can also be used to attach personal features, like that kangaroos remind me of my stay in Canberra in year 2000, and thus enhance findability, e.g. for a photo of a kangaroo I took at that time.

I’m saving more about tags, e.g. about tag clouds, creating categories from tags etc. for future posts.