Dependable OAI Repositories

With the rising popularity of Open Access, organizations expect their OAI repositories to be highly dependable. The repository must be able to deal with millions of records and respond quickly to frequent requests from Service Providers.

The Meresco community followed these developments by continuously improving Meresco’s OAI components. During this process, compliance to the OAI-PMH specification grew to near 100% and new specialized indexes were added to keep query response times well under one second.

History

Back in 2007 the first OAI-PMH repository components were implemented in the LOREnet project. The 16 components were reduced to 8 in the OpenER project for the Open University. These 8 components still exists but some of them were significantly refactored to keep up with load and volume requirements. End 2008, Berkely DB replaced Lucene, making it respond much faster in the presence of from and until request parameters. In 2009, huge amounts of sets in the LOREnet project required an even more specialized index to maintain query response times.

Present situation

Today, several multi-million repositories are in use by, among others, Sound & Vision (Beeld en Geluid) and the University of Tilburg (UvT). These two are examples of stand-alone repository implementations. LOREnet and EduRep are examples of repositories integrated in, respectively, a portal and a search engine.

Indexes and Storage

Initially, creating a repository was straightforward using Meresco’s existing storage and Lucene index components.  The new specialized indexes for OAI were also made available as reusable components.   This extends the range of available indexes, which are now: Full text (Lucene), Facets, Range and Dictionary (BerkelyDB and BurstTrie).

Repositories, Search Engines and Archives

Using the available index and storage components, a repository is just as easily created as a Search Engine or a complete Archive.   After all, these are quite similar things.  Any repository needs a storage, but also an index for maintaining it. Similarly every search engine needs a index but also a storage to obtain the result records from. And an archive is yet another combination of storage and index, but with different intentions.

Wikiwijs, EduRep and Meresco

The Wikiwijs website was launched December 14 by the Dutch minister of education as an open environment in which every teacher can find, use and adapt learning materials for any educational level. Wikiwijs search is powered by KennisNet’s EduRep platform which is based on Meresco.

Wikiwijs Search

The search part of Wikiwijs connects to EduRep to carry out search queries entered by the users of wikiwijs.nl. EduRep returns IEEE LOM records and wikiwijs in turn displays these nicely formatted  on the result page. Wikiwijs offers further refinement options by presenting selectors for the desired educational level and target audience.

Educational level

Users interested in certain educational levels, say MBO, can select these in a box besides the search results and press the button to update the result. We searched for luchtkwaliteit and the system responded with 263 results. We then selected MBO and the results are narrowed down to 193 results on MBO level.

Use of faceted search

If we select PO, SBaO and SO instead of MBO, the system responds with: no results found and suggest to remove filters and “hopefully Wikiwijs will find more”. It would be preferable if Wikiwijs would hide selectors that do not yield any results. The underlying search interface of EduRep gives this information, as can be seen in this query: SRU Query. It request a facet on the field lom.educational.context.value, which tells us the amount of matching records for each of the education levels. We hope that a future version of Wikiwijs will take advantage of this feature.

Technology

Wikiwijs connects to EduRep using the SRU protocol, a standard from The Library Of Congress. The SRU service is implemented with Meresco, which adds the extension parameter x-term-drilldown. This extension allows a client to request faceting on a given metadata field. For more information see Meresco Public Interfacs.odt in the Technical Documentation.

The response to the SRU query above would contain the following XML:

<dd:drilldown xsi:schemaLocation=”…”>
  <dd:term-drilldown>
    <dd:navigator name=”lom.educational.context.value”>
      <dd:item count=”193″>BVE</dd:item>
      <dd:item count=”4″>VO</dd:item>
      <dd:item count=”3″>HBO</dd:item>
    </dd:navigator>
  </dd:term-drilldown>
</dd:drilldown>

Autocomplete with 5 million+ proper names

The Library of the Technical University Delft (TUDelft) added an autocomplete function on their Discover site using Meresco. It suggests search terms using the full corpus of all their databases.  Also, it suggests more specific terms when users use fields in their query.

Proper names

TUDelfts’ databases contain many technical terms and other proper names such as structured chemical names.  Discover uses these to suggest search terms instead of history from users’ queries.  The example below shows a user typing tri and the autocomplete suggesting chemical names such as tri-0-acetyl and triacyglycerol from over 5.000.000 available terms.

Faceted Search integration

The autocomplete is fully integrated with the facets in Discover. As a result it is able to show exactly how many results a user can expect for each suggested term given the current selection of facets and of course it does not suggests terms that yield no results.

Suggestions for Fields

As an option, users can use fields to limit the range for specific keywords.  For example the query author=johnson will search for johnson only in the field author.  The search box automatically detects fields and applies the proper suggestions for that field.  The example below show a user searching for author=nahu with the search box suggesting 3 different spelling variants together with the amount of results to expect.

Implementation

Discover’s autocomplete has been implemented with Meresco’s autocomplete and facetting.  The autocomplete is capable of handling millions of terms under a high userload due to it implementation of a Burst Trie which is integrated with the facet index. On the client side it uses JQuery and CSS to present the autocomplete search box.