Weightless

The high concurrent performance of Meresco is not achieved by deploying an army of processes and threads but by the asynchronous power of Weightless.

Server processes are either synchronous or asynchronous.

Synchronous

Synchronous servers accept a connection and wait for the whole request to be received before processing. Every connection is handled by a single thread or process. The program flow in synchronous servers seems conceptionally easier, but often gets complicated in practice by all kinds of locking issues.

Asynchronous

Asynchronous servers read when there is data available and send responses when there is something to send, all in one single-threaded process. Code that runs within a asynchronous server needs to be crafted with special care to allow for fair resource sharing between requests. This is known as cooperative scheduling. Because there are no threads that interrupt and lock each other (simplifying the software when compared to the alternative) the server is able to handle a large number of concurrent connections without a noticable speed penalty.

Weightless

Weightless was developed to bring the advantages of asynchronous I/O to Meresco. Weightless is a lightweight framework that provides the infrastructure for asynchronous servers. Weightless comes with HTTP and HTTPS server functionality. By making use of Python generators (co-routines) to facilitate input and output, Weightless provides an easy to use mechanism to read and write data.

More information on weightless can be found at: http://weightless.io

Storage versus Index

In a lot of search engines the data and indices are stored together, creating a single huge entity. This approach potentionally leads to a number of problems, ranging from backup problems to performance issues. Also, with these systems access to the data is limited by what is offered through their respective APIs.

Index
Meresco works differently, using an index for what it is designed to do best: to return the identifiers of documents that match a query. The index of a book gives you the number of the page that covers a certain topic.
Similarly, the identifiers returned by a Meresco index point to documents in a separate storage. This leads to a simple index that even for millions of documents typically stays small enough to fit in memory entirely. This yields an obvious speed advantage.

Storage
The data is stored in a Meresco Storage. The storage is basically a well defined directory structure. Identifiers are used to pinpoint a directory in which the data is stored. This means that it can be stored on basically any filesystem (although e.g. the ext2/3 filesystems impose a limit on the number of subdirectories in a directory).

Native
Having all data in native format on disk makes it easier to control and maintain. Data can be read immediately without having to be decoded or transformed in any other way. Data enrichment tools, for example to get metadata from PDF files or digital images, can do their work in the background directly on the data files.

Caching
Many systems come with their own caching mechanisms. Meresco Storage however takes advantage of the disk caching capabilities of modern unix systems. This results in fast data lookups with no added complexity.

Conclusion
By keeping only identifiers, a Meresco index stays simple, small and fast. The accompanying storage offers fast retrieval of stored documents in their native formats.

Inbox component

For a long time the only means to insert records into a Meresco index was by harvesting them from an OAIĀ  repository. Over time a need arised to be able to insert records from non OAI sources. This has been accomplished by making use of the ‘Inbox’ component.

Several Meresco implementers already had their own database without an OAI-repository interface. Moreover it turned out to be impossibleĀ  to add OAI interfaces to these systems; some did just not provide the technical means necessary to construct such an interface. Most systems provide means to export their data into a file-format; being one large file or several smaller files. This gives Meresco an opportunity to index these records.

Implementation

The ‘Inbox’ component monitors a directory for file activity. Every file is read and the content is inserted into the application DNA of the server. By adding format specific components as observer of the inbox component, virtually any data format can be used and indexed. For example, using the standard Meresco XSLT crosswalk mechanism, the custom XML format can be converted to e.g. OAI Dublin Core or MODS.

Recent use

Recently the inbox was implemented in the TU Delft Library Discover Project as a means to update records selectively. The usage of the new search engine has uncovered several mistakes made over the years in the catalogue and these are now being corrected. After correcting the found mistakes the record is exported into the inbox and thereby automatically reindexed.