About Erik J. Groeneveld

I love beautiful software.

What to do with Linked Data?

The web is moving towards linked data. Many data collections are available as Linked Data, including Dutch scientific libraries, museums and archives. What can we do with all this data? What tools do we need? The good news is that Linked Data can be adopted incrementally.

What problem does Linked Data address?

Objects in libraries, museums and archives are increasingly described by experts not related to these institutions. The resulting descriptions relate to persons, places and other concepts, of which none of the experts or institutions can claim authority. Institutions are no longer the only authority on specific data collections and certainly not authoritative on all the concepts their collection relates to. Maintaining collections of authoritative information is becoming increasingly difficult. Life cycle management of metadata records [possibly even maintaining different versions for different communities] become major challenges. Failing to maintain a clear authoritative, and not isolated, collection undermines the existence of museums, archives and libraries and any other (broker) service that is to add value somehow.

How does Linked Data solve the problem?

Linked data allows everyone to make statements about everything [RDF Concepts]. It does not encapsulate knowledge about objects in a record, but represents the knowledge as a set of statements about the object. The record becomes a set of statements. This is simple, but fundamental. Consider the following record about a certain piece of art:

Record 5832:
    identifier = http://institute.org/987639
    title = "A true work of Art"
    creator = "V. van Gogh"

This could be represented with (at least) two statements:

http://institute.org/5832 has a title whose value is “A true work of Art”

http://institute.org/5832 has a creator whose value is “V. van Gogh”

The fundamental change here is that record 5832 no longer plays a role when exchanging data. The record has been artificially created to describe an object, but the record itself is not important, only the statements it introduces are. (Linked Data can transparently introduce intermediate objects to group statements, however these are not manageable items ‘to worry about’ as records were). Only these sets of statements are exchanged. Maintaining an authoritative collection comes down to carefully selecting sets of statements to join.

Resolving Statements

Formally a statement is a triple of (Subject, Predicate, Object). [RDF concepts]. Together, triples form graphs:

Subject and Predicate are always URIs, while Object can be an URI or a value. In the example above the Creator statement could have been:

http://institute.org/5832 has a creator whose URI is info:eu-repo/dai/nl/071792279

For the actual name of this person, we will have to look for statements saying something about info:eu-repo/dai/nl/071792279 as the subject. This again might resolve to a URI so we have to repeat the process until we find a value.

How can institutions take advantage of Linked Data?

If the world around an institution is a cloud of Linked Data sources, the center of the cloud is where the institution has most of its authority. Surrounding this authority center are the related data sources on which the institution has less authority. Together we call this the authority cloud.

With this as a reference, do the following little steps:

  1. Start seeing data collections as statements, both in your Authority Cloud and outside it. Don not worry when they are not in RDF, that is not required.
  2. Start with using global persistent identifiers for all your objects. This allows you and others to make statements about the objects and to have meaningful joins.
  3. Start gathering triples from the sources within your Authority Cloud in a Triple Store. When sources are not in RDF just use simple tools to extract triples.
  4. Populate your local services using the Triple Store to resolve others statements. For example, while indexing your own metadata, use the triple store to create additional search fields, facets, tag clouds etc.
  5. While displaying objects, turn unresolved statements into click-able links.
  6. For advanced users: start making use of the Triple Store’s query capabilities for enhancing your services.

    What tools are needed to deal with Linked Data?

    Keep your tools! Unless you are dissatisfied of course, retain your investment. You will need a scalable triple store in your own data center however. Since this Triple Store contains all the statements you decided need resolving before offering your service, it must be fast and readily available.

    In the next installment of this blog, we will outline how MERESCO can be used to implement Linked Data.

    Wikiwijs, EduRep and Meresco

    The Wikiwijs website was launched December 14 by the Dutch minister of education as an open environment in which every teacher can find, use and adapt learning materials for any educational level. Wikiwijs search is powered by KennisNet’s EduRep platform which is based on Meresco.

    Wikiwijs Search

    The search part of Wikiwijs connects to EduRep to carry out search queries entered by the users of wikiwijs.nl. EduRep returns IEEE LOM records and wikiwijs in turn displays these nicely formatted  on the result page. Wikiwijs offers further refinement options by presenting selectors for the desired educational level and target audience.

    Educational level

    Users interested in certain educational levels, say MBO, can select these in a box besides the search results and press the button to update the result. We searched for luchtkwaliteit and the system responded with 263 results. We then selected MBO and the results are narrowed down to 193 results on MBO level.

    Use of faceted search

    If we select PO, SBaO and SO instead of MBO, the system responds with: no results found and suggest to remove filters and “hopefully Wikiwijs will find more”. It would be preferable if Wikiwijs would hide selectors that do not yield any results. The underlying search interface of EduRep gives this information, as can be seen in this query: SRU Query. It request a facet on the field lom.educational.context.value, which tells us the amount of matching records for each of the education levels. We hope that a future version of Wikiwijs will take advantage of this feature.

    Technology

    Wikiwijs connects to EduRep using the SRU protocol, a standard from The Library Of Congress. The SRU service is implemented with Meresco, which adds the extension parameter x-term-drilldown. This extension allows a client to request faceting on a given metadata field. For more information see Meresco Public Interfacs.odt in the Technical Documentation.

    The response to the SRU query above would contain the following XML:

    <dd:drilldown xsi:schemaLocation=”…”>
      <dd:term-drilldown>
        <dd:navigator name=”lom.educational.context.value”>
          <dd:item count=”193″>BVE</dd:item>
          <dd:item count=”4″>VO</dd:item>
          <dd:item count=”3″>HBO</dd:item>
        </dd:navigator>
      </dd:term-drilldown>
    </dd:drilldown>

    Autocomplete with 5 million+ proper names

    The Library of the Technical University Delft (TUDelft) added an autocomplete function on their Discover site using Meresco. It suggests search terms using the full corpus of all their databases.  Also, it suggests more specific terms when users use fields in their query.

    Proper names

    TUDelfts’ databases contain many technical terms and other proper names such as structured chemical names.  Discover uses these to suggest search terms instead of history from users’ queries.  The example below shows a user typing tri and the autocomplete suggesting chemical names such as tri-0-acetyl and triacyglycerol from over 5.000.000 available terms.

    Faceted Search integration

    The autocomplete is fully integrated with the facets in Discover. As a result it is able to show exactly how many results a user can expect for each suggested term given the current selection of facets and of course it does not suggests terms that yield no results.

    Suggestions for Fields

    As an option, users can use fields to limit the range for specific keywords.  For example the query author=johnson will search for johnson only in the field author.  The search box automatically detects fields and applies the proper suggestions for that field.  The example below show a user searching for author=nahu with the search box suggesting 3 different spelling variants together with the amount of results to expect.

    Implementation

    Discover’s autocomplete has been implemented with Meresco’s autocomplete and facetting.  The autocomplete is capable of handling millions of terms under a high userload due to it implementation of a Burst Trie which is integrated with the facet index. On the client side it uses JQuery and CSS to present the autocomplete search box.