Archive for October, 2008

Recontextualization of metadata – part 2: databases

In a previous post, I asked the question under which circumstances descriptors can be powerful for findability. The classical example are databases – the attributes to the objects are nothing else but descriptors, and highly standardized ones at that. The database itself delivers the context for the objects it contains: it is made expressly for certain data and for a certain reason, thus providing clear and reliable context.

But if describing objects in a consistent way is such a drag, why would anyone bother to do it? Or to put it the other way round: Is this a way to make structuring information more intuitive, simple and pleasurable to use?

Let’s take a more detailed look at the first part of the question: Why does anyone bother about describing objects in a database? Because objects in databases are similar to each other and detailed descriptions of the objects makes them comparable. In a database for flights, e.g. Kayak, you can compare all flights with a common origin and destination, e.g. from Amsterdam to Zurich, or you might want to find out how far 1000 Swiss francs could get you (though I haven’t yet seen this implemented). The drive for providers of databases or vertical search engines to make the effort of describing objects is, on the one hand, to offer the largest range of data, and on the other hand, to make them as easily comparable as possible to meet consumers’ needs.

Flight Amsterdam - Zurich on kayak.com

Flight Amsterdam - Zurich on kayak.com

Concerning the second part of the question, do databases make structuring information more intuitive, simple and pleasurable to use? The answer is no. Anyone who has ever been involved in aggregating structured data from different sources will tell you what back-breaking work it is. This even holds true for highly standardized environments as international aviation. The relative homogeneity of the data described in a database may make the task of consistent description simpler, but the drawback is that databases are rigid. Conventions need to be agreed on to make consistent description possible, and they reduce the objects to the chosen characteristics.

The rigidity of databases means they are not well adapted to change. This is very well argued by Hank Williams in his post The Death of the Relational Database (why an eloquent writer would choose the title Why does everything suck? for his blog is beyond a square like myself, but it’s worth the read):

As long as you don’t want to radically change or expand the scope of what you are doing, relational databases are great. But knowledge is an ever-expanding universe of objects and relationships between them. The relational database doesn’t handle that use case very well.

In my last post as well as in this one, I argued that the consistent use of descriptors is only possible in very tight limits and that in order to take complexity – and change – into account, more flexible concepts are necessary. Is the Semantic Web an answer to that? As fascinating as the prospect sounds, is it really possible to detach relations from their context?  The third and final part of this series shall look into this.

Advertisements

Recontextualization of metadata – part 1: keywords & classifications

In the last post, I pondered the question if descriptors can be powerful for findability. I’ll avoid defining what findability is and discuss a few examples instead.

Example 1: Libraries

Descriptors are heavily used in libraries, but do they improve findability in practice? A search in two different university library catalogues (Basle and Zurich) for the interdisciplinary subject of the history of chemistry in Switzerland returned but two matches for about 30 titles in each library. It is noteworthy that both libraries seem to possess the same books (I did a series of cross-checks) but deliver different search results for the same query.

The search was conducted overall with the keywords of «geschichte schweiz chemie» (history, Switzerland, chemistry). The screenshots of the keywords for an identical book (which I found only in one library with the keywords mentioned) show why the results are so different:

Keywords for Tobias Straumann’s «Die Schöpfung im Reagenzglas» in Basle’s university library catalogue

Keywords for Tobias Straumann’s «Die Schöpfung im Reagenzglas» in Basle’s university library catalogue

Keywords for the same title in Zurich’s university library catalogue

Keywords for the same title in Zurich’s university library catalogue

The keywords were helpful to find relevant works in both libraries, because the titles in their majority are totally useless for a subject search. Until recently, no Swiss library included any contextual information such as summaries or tables of contents into their search. However, the quality of the results of the search are dubious. The difficulties are very much the same as for search in unstructured data. The keywords are either too broad or too narrow, the controlled vocabularies are not flexible enough to account for all possible variations, and the human factor leads to inconsistent use of descriptors. Furthermore, it is impossible to reconstruct the reason for the divergence of the results.

Example 2: Classifications of Swiss Legislation

To make periodically published and revised legislation accessible by subject, it is filed in a classification. Even though in Switzerland the federal, cantonal (state) and communal jurisdictions are sovereign, the classifications of legislation are similar in large parts of the country and over all levels of state. The classifications were developed long before the Web and have been adapted since. Primary access is by browsing alphabetically or by subject.

Alphabetical index of the Compilation of Federal Legislation

Alphabetical index of the Compilation of Federal Legislation

The alphabetical index is a highly elaborated system of cross reference. It helps with disambiguation and points to results in less obvious categories.

Classification of the Swiss Confederation (corresponding excerpt)

Classification of the Swiss Confederation (excerpt)Classification of the canton of Basel-Stadt (corresponding excerpt)

Access by subjects gives the user an idea of the scope of the collection. The hierarchical approach allows him or her to quickly drill down to the relevant subject.

To find the corresponding legislation over different levels of jurisdiction, a human user can easily find the resemblances between similar classifications. For automatic aggregation, this is more of an obstacle. Therefore, the Swiss organization for e-government standards, e.ch, has published classifications for e-government subjects (in German).

Rather surprisingly, the standard is not used consistently in the cantons cited as best practice examples (e.g. compare the cantons of Aargau and Basel-Stadt), and the synonyms and related terms recommended for the improvement of findability are not being used.

eCH standard

eCH standard

Version of Canton Aargau

Version of Canton Aargau

Conclusions

Keywords improve findability when there is no other information available, e.g. in library catalogues without digitized content. However, keywords are only reliable when the mechanism of attributing them is transparent. «Intellectual control» – the term archivists use for creating tools to access their records – is not only important for those who create the tools, but just as well for those who use them in order to gain access. Lack of transparency means loss of control, and even when laid open, systems of cross-reference can be confusing for the user, time-consuming to maintain and error-prone.

The other example of the classification of legislation shows that transparency can at least partly compensate for lack of control. This should not be an excuse for ignoring existing standards which are primarily created for the sake of interoperability and to facilitate maintenance. But the examples do show that from a certain level of detail, control gets out of hand. The objects described often have too many facets to be consistently described – or to put it the other way round, it is difficult to create comprehensive classification systems. Thus, at a deeper level, classifications can easily shift from an asset to a risk, because results tend to become unreliable. In the digital world, I believe, it makes more sense to abandon descriptors at that level and to retreat to methods of natural language processing.