Labeled Interface Invocation

Meresco’s Latest Addition
Meresco has received an important upgrade. With this upgrade the flexibility of DNA configuration greatly improved. Supporting different implementations for the same interface has become even more flexible with Labeled Interface Invocation (LII).

Multiple Implementations For an Interface
DNA makes a distinction between interfaces and implementation, which is a well-known practice.  However, sometimes there are multiple distinct implementations for the same interface.  For example, two Storage components might support the same interface, yet representing distinct storages.  They can have distinct performance properties or represent distinct data sets.  DNA already supports this by supporting multiple implementations for any interface. At configuration time, one branch might use one implementation while another branch uses another. For a step-by-step programmers guide too how this works see Component Configuration with DNA.

Components Choose Their Own Implementations
Labeled Interface Invocation allows components to choose distinctive implementations of the same interface at run-time. This is done by first labeling the implementations during configuration, and then use these labels during the invocation of interfaces.

Suppose we have two storages supporting the interface

    addNew(data)

Normally, a message to an arbitrary implementation of this interface would look like:

    self.any.addNew(someData)

Now suppose that there are two implementations, differing in the way they store and safeguard data. Lets label these implementations “reviews” and “audits”:

    reviewStore = Storage(name="reviews")
    auditStore = Storage(name="audits")

With Labeled Interface Invocation we can now target messages to any of these implementation as follows:

   self.any["reviews"].addNew(newReview)
   self.any["audits"].addNew(auditRecord)

An interesting aspect of the lines above is that the constant labels “reviews” and “audits” can in fact be variables. Thus, labels can also be computed and routing messages to different components can be done on the fly. For example:

    target = ... compute or lookup a label ...
    self.any[target].addNew(someData)

Conclusion
Meresco’s DNA has been stable for about two years. So new DNA features are rare, and we were pleasantly surprised to have found one. As usual, it took a long time, and a lot of thinking, but only a few lines of code; 3 actually. Most code changes are for tests and other components to used LLI. Here is the complete LLI changeset.

Have any questions about this feature or want support to use it your self? Please contact me.

Open Space MERESCO meeting June 25th 2010

Nederlandse versie staat hieronder.

On June 25th SURFfoundation and Seek You Too will organize a MERESCO meeting in Utrecht. It will be a meeting accessible for everyone who is interested.

Last year there was a conference wherein we explained the uses of MERESCO, and there was time to discuss certain matters concerning MERESCO. There also were some requests for subjects for the next meeting, and those subjects will be treated during the oncoming meeting.

This meeting will be an Open Space meeting so it will be possible to about new subjects with the other visitors, or with the experts who will also attend the meeting.

We will finish the meeting with some refreshments.

Nederlandse versie

SURFfoundation en Seek You Too zullen op 25 juni een MERESCO bijeenkomst organiseren bij SURFfoundation in Utrecht. Het zal een open bijeenkomst zijn; alle geinteresseerden zijn welkom.

MERESCO staat voor ‘MEtadata based REpository Search Components in Open source’. Het is een open source platform ontwikkeld in verschillende projecten van SURFfoundation, SURFnet en Kennisnet door Seek You Too (CQ2).

Tijdens de conferentie van vorig jaar zijn de toepassingen van MERESCO toegelicht en is er tijd geweest om onderling kennis uit te wisselen. Er zijn toen tevens verzoeken gedaan voor onderwerpen voor de volgende bijeenkomst. Deze onderwerpen zullen tijdens deze bijeenkomst besproken worden.

De bijeenkomst van dit jaar zal een Open Space bijeenkomst zijn waarin u zelf ook weer nieuwe onderwerpen kunt aansnijden. Er zullen experts aanwezig zijn die actief aan de gesprekken zullen deelnemen en eventuele vragen kunnen beantwoorden.

We zullen afsluiten met een hapje en een drankje.

How to scale up Meresco

Recently Kennisnet asked me how to scale up Edurep with regard to:
- queries per second
- record updates per second
- total number of records

I suspect that this is of broader interest, so below are two approaches for scaling CPUs, memory or bandwidth.

Queries per second
A single machine Meresco system runs between 10 and 100 queries per second. Scaling this requires adding more machines so load can be distributed over CPUs and networks. There are two approaches.

Approach A
Replicate the entire server process and feed updates to them simultaneously.

Approach B
Extract the most demanding components from the server’s configuration and put these on separate machines. Reconnect them using the Inbox component.

 
Before After

Both approaches are based on standard Meresco functionality and therefore easily configured.

Record updates per second
Meresco is able to process 1 to 10 updates per second concurrently with querying. Scaling this up requires adding machines that can share the load of processing the records using approach B. These machines can feed into one or more query processing machines, effectively enabling scaling along both axes.

The main idea is to decompose a system into subsystems which can be distributed and replicated. This analysis must be done before a system can scale up using cloud-like environments. How Meresco’s configuration supports this will be outlined in a future blog.

Total number of records
Meresco can host 10 – 100 million records on one machine, mostly limited by what its indexes can do. Scaling up requires a closer look at these indexes to see how additional resources must be allocated. In this area Lucene, BerkeleyDB and OWLIM have earned great reputations. Meresco’s architecture helps to get the most out of these.

Meresco’s homegrown Facet Index and Sorted Dictionary Index (used for auto-complete) can be scaled following approach B. However, with a single-node limit of roughly one billion records most applications would not need more than one node.

Conclusion
I realize that I only scratched the surface of how to scale Meresco. There are many details to discuss and you probably wonder how your situation could be dealt with. I’d love to hear your responses!

Integrating Java in Python with JTool

MERESCO combines components written in various programming languages. It uses Python to tie these components together. It integrates Java using JTool.

It began with Lucene
Lucene is a well-known Java library for full-text search. MERESCO used PyLucene which compiled Lucene to native machine code. PyLucene was unstable and did not cover all of Lucene. In 2008 it changed strategy and the performance dropped significantly. We decided to try a completely different approach and that turned out to work very well.

What did we try?
We quickly discovered that compiling Lucene with GCJ was easy and that it resulted in robust, fast and reliable programs. Then we created a Python extension called JTool which mirrors the complete Java API in Python.

How to use?
Here is how you use it in Python:

$ python
>>> import jtool
>>> jtool.load('liblucene-core.so')  # compiled lucene-core.jar
>>> from org.apache.lucene.index import IndexReader
>>> reader = IndexReader.open("/indexdir")

This is how all of Lucene is accessed in MERESCO. It runs fast, reliable and with low memory footprint. The code base of JTool is only 1500 lines, there is no code generation and it is completely generic. So the next question is:

Will JTool work for other Java libraries?
In February 2010 we started looking for a more scalable Triple Store for MERESCO. Our choice was OWLIM…. written in Java. While Lucene is quite a large library, OWLIM is even larger. The latter depends on 22 other Java projects, including the Sesame RDF Framework.

Compilation of OWLIM took a bit more effort as we needed to gather all needed jar files and make sure some factories did not get duplicated in the final library. Then we tried to load this library in Python using JTool:

>>> import jtool
>>> jtool.load("libowlim-core.so")
>>> from org.openrdf.repository.sail import SailRepository
>>> from org.openrdf.query import QueryLanguage
>>> ...

This enabled us to insert RDF and execute SPARQL queries on the triple store. Yes it works!

Future of JTool
JTool can not yet call methods with NULL-parameter or Java 5 varargs. It also does not support callbacks in Python yet. We have solutions for these omissions which we will implement this year. Meanwhile, it is easy enough to create a Java wrapper and use this via JTool. So JTool allows us to quickly integrate any Java libraries in MERESCO.

Avaliability
Sources for JTool up to version 4 are available JTool Sources.
JTool version 5 and up are available in binary form JTool Binaries.