What makes Meresco different from Solr?

Solr focusses to get the most out of one index type: Lucene. Meresco supports a number of different index types, each specialized for a specific task. Queries are split, each part processed by the most appropriate index, and the results are integrated. This ensures that all types of queries are processed within tens or at most hundreds of milliseconds.

Each type of index has distinct and unique properties such as specific query algorithms, optimized access patterns and scalability. We will introduce each index type below together with a short characterization.

Fulltext Index
This index is optimized for queries for a combination of words, literal phrases, words nearby other words etc. It is implemented with Lucene which is known to scale very well. Meresco helps scaling it by keeping it small, this post about Storage versus Index.

Facet Index
The facet index specializes in drilldown (faceting) queries, dynamic clustering and tag clouds. It produces exact results even on large data sets, which is one of Merescos unique selling points. Meresco uses custom data structures and algorithms which scale to billions of postings on a single node.

Dictionary Index
This index supports fast lookup of arbitrary textual information related to keys. It supports simple lookup ‘queries’ only. It is implemented using Berkeley DB, which is known for its good scalability and performance. It is being used to scale up set and metadataPrefix queries in OAI-PMH to tens of millions of records. This post describes the process: Dependable OAI Repositories.

Sorted Dictionary Index
This index supports extremely fast lookup of simple numeric information attached to alphabetically ordered terms. It supports prefix queries such as needed for auto-complete. It is implemented using a Burst Trie.

Triple Store
This index supports queries about arbitrary relationships between objects (graph-inference) typically through SPARQL or extensions to CQL. It is implemented with rdflib and OWLIM, the former being simple, the latter being one of the most scalable and fast triple stores around. An application is relating traditional records to social metadata such as tagging, ratings and reviews. A lot is going to happen around here.

Range Index
This index supports ultra fast retrieval of data contained in numerical ranges. It supports range queries such as 20090101 < date <= 20101231. Meresco has its own optimized implementation. This index is so small, it scales to billions of documents even on a single node.

N-gram
The n-gram index is capable of performing approximate matches and hence used for suggestions in ‘Did you mean?’-like solutions. More generally it allows for language neutral queries. This index lays on top of the Lucene index, but is nominated to be replaced by a faster and more specialized one in 2010

Meresco can maintain these index types in sync both during batch and real-time updates. Together, these indexes deliver fast results to queries, even if those queries are complicated and demanding such as tag clouds, auto-complete, clustering, term suggestions, did-you-mean and relationship queries.

Weightless

The high concurrent performance of Meresco is not achieved by deploying an army of processes and threads but by the asynchronous power of Weightless.

Server processes are either synchronous or asynchronous.

Synchronous

Synchronous servers accept a connection and wait for the whole request to be received before processing. Every connection is handled by a single thread or process. The program flow in synchronous servers seems conceptionally easier, but often gets complicated in practice by all kinds of locking issues.

Asynchronous

Asynchronous servers read when there is data available and send responses when there is something to send, all in one single-threaded process. Code that runs within a asynchronous server needs to be crafted with special care to allow for fair resource sharing between requests. This is known as cooperative scheduling. Because there are no threads that interrupt and lock each other (simplifying the software when compared to the alternative) the server is able to handle a large number of concurrent connections without a noticable speed penalty.

Weightless

Weightless was developed to bring the advantages of asynchronous I/O to Meresco. Weightless is a lightweight framework that provides the infrastructure for asynchronous servers. Weightless comes with HTTP and HTTPS server functionality. By making use of Python generators (co-routines) to facilitate input and output, Weightless provides an easy to use mechanism to read and write data.

More information on weightless can be found at: http://weightless.io

Storage versus Index

In a lot of search engines the data and indices are stored together, creating a single huge entity. This approach potentionally leads to a number of problems, ranging from backup problems to performance issues. Also, with these systems access to the data is limited by what is offered through their respective APIs.

Index
Meresco works differently, using an index for what it is designed to do best: to return the identifiers of documents that match a query. The index of a book gives you the number of the page that covers a certain topic.
Similarly, the identifiers returned by a Meresco index point to documents in a separate storage. This leads to a simple index that even for millions of documents typically stays small enough to fit in memory entirely. This yields an obvious speed advantage.

Storage
The data is stored in a Meresco Storage. The storage is basically a well defined directory structure. Identifiers are used to pinpoint a directory in which the data is stored. This means that it can be stored on basically any filesystem (although e.g. the ext2/3 filesystems impose a limit on the number of subdirectories in a directory).

Native
Having all data in native format on disk makes it easier to control and maintain. Data can be read immediately without having to be decoded or transformed in any other way. Data enrichment tools, for example to get metadata from PDF files or digital images, can do their work in the background directly on the data files.

Caching
Many systems come with their own caching mechanisms. Meresco Storage however takes advantage of the disk caching capabilities of modern unix systems. This results in fast data lookups with no added complexity.

Conclusion
By keeping only identifiers, a Meresco index stays simple, small and fast. The accompanying storage offers fast retrieval of stored documents in their native formats.

What to do with Linked Data?

The web is moving towards linked data. Many data collections are available as Linked Data, including Dutch scientific libraries, museums and archives. What can we do with all this data? What tools do we need? The good news is that Linked Data can be adopted incrementally.

What problem does Linked Data address?

Objects in libraries, museums and archives are increasingly described by experts not related to these institutions. The resulting descriptions relate to persons, places and other concepts, of which none of the experts or institutions can claim authority. Institutions are no longer the only authority on specific data collections and certainly not authoritative on all the concepts their collection relates to. Maintaining collections of authoritative information is becoming increasingly difficult. Life cycle management of metadata records [possibly even maintaining different versions for different communities] become major challenges. Failing to maintain a clear authoritative, and not isolated, collection undermines the existence of museums, archives and libraries and any other (broker) service that is to add value somehow.

How does Linked Data solve the problem?

Linked data allows everyone to make statements about everything [RDF Concepts]. It does not encapsulate knowledge about objects in a record, but represents the knowledge as a set of statements about the object. The record becomes a set of statements. This is simple, but fundamental. Consider the following record about a certain piece of art:

Record 5832:
    identifier = http://institute.org/987639
    title = "A true work of Art"
    creator = "V. van Gogh"

This could be represented with (at least) two statements:

http://institute.org/5832 has a title whose value is “A true work of Art”

http://institute.org/5832 has a creator whose value is “V. van Gogh”

The fundamental change here is that record 5832 no longer plays a role when exchanging data. The record has been artificially created to describe an object, but the record itself is not important, only the statements it introduces are. (Linked Data can transparently introduce intermediate objects to group statements, however these are not manageable items ‘to worry about’ as records were). Only these sets of statements are exchanged. Maintaining an authoritative collection comes down to carefully selecting sets of statements to join.

Resolving Statements

Formally a statement is a triple of (Subject, Predicate, Object). [RDF concepts]. Together, triples form graphs:

Subject and Predicate are always URIs, while Object can be an URI or a value. In the example above the Creator statement could have been:

http://institute.org/5832 has a creator whose URI is info:eu-repo/dai/nl/071792279

For the actual name of this person, we will have to look for statements saying something about info:eu-repo/dai/nl/071792279 as the subject. This again might resolve to a URI so we have to repeat the process until we find a value.

How can institutions take advantage of Linked Data?

If the world around an institution is a cloud of Linked Data sources, the center of the cloud is where the institution has most of its authority. Surrounding this authority center are the related data sources on which the institution has less authority. Together we call this the authority cloud.

With this as a reference, do the following little steps:

  1. Start seeing data collections as statements, both in your Authority Cloud and outside it. Don not worry when they are not in RDF, that is not required.
  2. Start with using global persistent identifiers for all your objects. This allows you and others to make statements about the objects and to have meaningful joins.
  3. Start gathering triples from the sources within your Authority Cloud in a Triple Store. When sources are not in RDF just use simple tools to extract triples.
  4. Populate your local services using the Triple Store to resolve others statements. For example, while indexing your own metadata, use the triple store to create additional search fields, facets, tag clouds etc.
  5. While displaying objects, turn unresolved statements into click-able links.
  6. For advanced users: start making use of the Triple Store’s query capabilities for enhancing your services.

    What tools are needed to deal with Linked Data?

    Keep your tools! Unless you are dissatisfied of course, retain your investment. You will need a scalable triple store in your own data center however. Since this Triple Store contains all the statements you decided need resolving before offering your service, it must be fast and readily available.

    In the next installment of this blog, we will outline how MERESCO can be used to implement Linked Data.

    Inbox component

    For a long time the only means to insert records into a Meresco index was by harvesting them from an OAI  repository. Over time a need arised to be able to insert records from non OAI sources. This has been accomplished by making use of the ‘Inbox’ component.

    Several Meresco implementers already had their own database without an OAI-repository interface. Moreover it turned out to be impossible  to add OAI interfaces to these systems; some did just not provide the technical means necessary to construct such an interface. Most systems provide means to export their data into a file-format; being one large file or several smaller files. This gives Meresco an opportunity to index these records.

    Implementation

    The ‘Inbox’ component monitors a directory for file activity. Every file is read and the content is inserted into the application DNA of the server. By adding format specific components as observer of the inbox component, virtually any data format can be used and indexed. For example, using the standard Meresco XSLT crosswalk mechanism, the custom XML format can be converted to e.g. OAI Dublin Core or MODS.

    Recent use

    Recently the inbox was implemented in the TU Delft Library Discover Project as a means to update records selectively. The usage of the new search engine has uncovered several mistakes made over the years in the catalogue and these are now being corrected. After correcting the found mistakes the record is exported into the inbox and thereby automatically reindexed.

    Dependable OAI Repositories

    With the rising popularity of Open Access, organizations expect their OAI repositories to be highly dependable. The repository must be able to deal with millions of records and respond quickly to frequent requests from Service Providers.

    The Meresco community followed these developments by continuously improving Meresco’s OAI components. During this process, compliance to the OAI-PMH specification grew to near 100% and new specialized indexes were added to keep query response times well under one second.

    History

    Back in 2007 the first OAI-PMH repository components were implemented in the LOREnet project. The 16 components were reduced to 8 in the OpenER project for the Open University. These 8 components still exists but some of them were significantly refactored to keep up with load and volume requirements. End 2008, Berkely DB replaced Lucene, making it respond much faster in the presence of from and until request parameters. In 2009, huge amounts of sets in the LOREnet project required an even more specialized index to maintain query response times.

    Present situation

    Today, several multi-million repositories are in use by, among others, Sound & Vision (Beeld en Geluid) and the University of Tilburg (UvT). These two are examples of stand-alone repository implementations. LOREnet and EduRep are examples of repositories integrated in, respectively, a portal and a search engine.

    Indexes and Storage

    Initially, creating a repository was straightforward using Meresco’s existing storage and Lucene index components.  The new specialized indexes for OAI were also made available as reusable components.   This extends the range of available indexes, which are now: Full text (Lucene), Facets, Range and Dictionary (BerkelyDB and BurstTrie).

    Repositories, Search Engines and Archives

    Using the available index and storage components, a repository is just as easily created as a Search Engine or a complete Archive.   After all, these are quite similar things.  Any repository needs a storage, but also an index for maintaining it. Similarly every search engine needs a index but also a storage to obtain the result records from. And an archive is yet another combination of storage and index, but with different intentions.

    Wikiwijs, EduRep and Meresco

    The Wikiwijs website was launched December 14 by the Dutch minister of education as an open environment in which every teacher can find, use and adapt learning materials for any educational level. Wikiwijs search is powered by KennisNet’s EduRep platform which is based on Meresco.

    Wikiwijs Search

    The search part of Wikiwijs connects to EduRep to carry out search queries entered by the users of wikiwijs.nl. EduRep returns IEEE LOM records and wikiwijs in turn displays these nicely formatted  on the result page. Wikiwijs offers further refinement options by presenting selectors for the desired educational level and target audience.

    Educational level

    Users interested in certain educational levels, say MBO, can select these in a box besides the search results and press the button to update the result. We searched for luchtkwaliteit and the system responded with 263 results. We then selected MBO and the results are narrowed down to 193 results on MBO level.

    Use of faceted search

    If we select PO, SBaO and SO instead of MBO, the system responds with: no results found and suggest to remove filters and “hopefully Wikiwijs will find more”. It would be preferable if Wikiwijs would hide selectors that do not yield any results. The underlying search interface of EduRep gives this information, as can be seen in this query: SRU Query. It request a facet on the field lom.educational.context.value, which tells us the amount of matching records for each of the education levels. We hope that a future version of Wikiwijs will take advantage of this feature.

    Technology

    Wikiwijs connects to EduRep using the SRU protocol, a standard from The Library Of Congress. The SRU service is implemented with Meresco, which adds the extension parameter x-term-drilldown. This extension allows a client to request faceting on a given metadata field. For more information see Meresco Public Interfacs.odt in the Technical Documentation.

    The response to the SRU query above would contain the following XML:

    <dd:drilldown xsi:schemaLocation=”…”>
      <dd:term-drilldown>
        <dd:navigator name=”lom.educational.context.value”>
          <dd:item count=”193″>BVE</dd:item>
          <dd:item count=”4″>VO</dd:item>
          <dd:item count=”3″>HBO</dd:item>
        </dd:navigator>
      </dd:term-drilldown>
    </dd:drilldown>