MERESCO

What happened in the MERESCO community?

Main menu

Skip to primary content
Skip to secondary content
  • Home
  • About Meresco

Tag Archives: search

Open Index: how to find relevant indexes

Posted on July 29, 2011 by Erik J. Groeneveld
Reply

The Open Index consists of different, independent indexes. They all implement a simple query protocol that integrates the results into one final result list. The topic of this blog is to outline how the selection of relevant indexes, within an Open Index, for a specific query works. Or could work, for this is just one way of doing it.

Meta Index

One of the key elements of the Open Index is the Meta Index (one or more). It does not index documents, it indexes indexes. It records, among other things, which index indexes which vocabularies. Also, it indexes the vocabularies itself. Let me explain that.

Vocabulary indexing

When an index joins an Open Index it registers itself with the Meta Index by telling the Meta Index its own location. Also, it tells the Meta Index which vocabularies it uses to index its documents. Many indexes use the Dublin Core vocabulary for indexing, but more specialized indexes use more specialized vocabularies such as FOAF (for social relationships), Geo (longitude and latitude info), MusicBrainz (to describe music, performances, concerts). So now the Meta Index knows exactly which indexes specialize on certain vocabularies.

For every vocabulary reference the Meta Index receives, it retrieves the descriptions of the vocabularies itself (does this make it a meta-meta index then?). That means that it now also knows which fields each vocabulary contains. For example, the MusicBrainz vocabulary contains fields like Album, Artist, Track and so on. For some vocabularies, it might even know about possible values too. The GTAA (an extensive vocabulary for TV etc.) for example contains approximately 97.000 Persons, 27.000 Names, 14.000 Locations and 18.000 (TV) Makers. The Meta Index knows all these names, and it knows if a name representes a person or a location.

Querying for Vocabularies

For any given query, the Meta Index can tell which indexes that are part of the Open Index will give meaningful results. How this is done? In two ways. Suppose someone enters the query “artist=lennon”. If you send this query to the Meta Index, it will lookup which vocabularies have a field named ‘artist’ (ignoring a few problems that arise during matching here), then it will look up the indexes registered for these vocabularies and it will send you this list of indexes. The next step is to send the same query to these indexes and integrate the results.

Now suppose you would enter a simpler query such as “yvon jaspers”. The Meta Index could lookup the word “yvon jaspers” and find it to be in a list of names for television Makers in the GTAA. So it gives you the list of GTAA indexes, and you could take this as a suggestion to include these in your query.

Automating versus User controlled
The examples above assume you being in control of querying the Meta Index and deciding what to do with the hits it gives you. In practise however, the interaction with the Meta Index will be invisible to users. A search portal might show the suggestions from the Meta Index and let the user free to direct his query to one or more of the suggested indexes. For example by saying “did you mean to search for TV maker ‘yvon jaspers’?”. Another search portal might simply take the hints from the Meta Index directly, carry out the users search query on all of them and just show the results.

Freedom of Design Choices

It all comes down to decoupling design choices: creating flexibility because we can not see into the future. With current technology (big integrated indexes), the choice for a particular search engine often implies many other choices you will often only become aware of later. One such decision is the way the search engine deals with multiple indexes. The Open Index allows such decisions to be made separately. The way of working as outlined above is only our first take on how we will do it. It could be any other algorithm in the future.

Next

In a next installment of this blog I’ll cover efficiency and scalability of the Open Index.

Posted in technology, vision | Tagged late integration, library, lucene, meresco, metadata, open index, openindex, owlim, search, solr | Leave a reply

Open Index: Query Resolving

Posted on July 15, 2011 by Erik J. Groeneveld
Reply

Open Index: Query Resolving

This second installment about the Open Index deals with how search queries are processed. Each index contains RDF descriptions (metadata) for objects identified by URIs (identifiers). The objects are not in the index, the metadata and the URIs are.

Late Integration

Late integration means that integration happens late in the process: during query processing. This is quite different from integrating metadata databases. But it is even quite different from the current practise of leaving the databases in place and integrating the metadata into one monolithic index.

Wide or deep

In the Open Index, when an index contains few metadata for many objects, then we call this a wide index. An index containing much metadata for a small set of objects, we call a deep index. Current style indexes are often both wide and deep, as they try to encompass everything. Indexes may contain overlapping sets of URIs, such as, one index might enrich another.

Leading Index

The index that receives a query from a client (e.g. portal) is called the leading index. It determines which other indexes to involve and it takes care of the final ranking and faceting. It returns to the user a top list of matching records, along with facet data.

Integration

The leading index executes the query itself and sends it to one or more other indexes. It asks the other indexes to return only the URIs (conceptually). The leading index integrates these sets using set arithmetic resulting in one set of hits. This arithmetic is the same as is found deep inside search engines.

Facets

Based on the set of hits the leading index adds facets. These are correct and complete facets; not estimates. (How facets can be distributed is the subject of an advanced blog later).

Relevancy

The Open Index is technology agnostic and lends the solution to integrating relevancy from the way indexes are intended to be used. Assumptions:

  1. The deeper the index, the more relevant its hits are. It is more specialized.
  2. The more indexes yield hits on a URI, the more relevant it is.
  3. Native relevancy of the leading index can be used when the width of this index determines the full scope of the query.

A later blog will go into more detail about relevancy.

End result

Once a top of hits has been determined, the leading index gathers the RDF documents for this top. It does so by asking the same indexes for their parts and then it merges the results into on one RDF description and sends this to the client.

What’s next?

In the following installment of this blog I will describe how indexes are found and how optimizations make the basic idea outlined so far scale up.

Posted in technology, vision | Tagged late integration, library, lucene, meresco, metadata, open index, owlim, search, solr | Leave a reply

Open Index or Late Integration

Posted on July 8, 2011 by Erik J. Groeneveld
Reply

Recently, Bibliotheek.nl and the Dutch Royal Library have started cooperation to develop the Open Index. Gerard Kuys and I developed the idea.

What does it do?
The Open Index makes maintaining search indexes easier while delivering more room for specialized metadata formats. It makes no choices regarding funtionality or technology unless it is essential for the concept to work properly. It is Open in a sense that:

  1. Any metadata format can play.
  2. Any technology can be used.

No more monolithic indexes
The main idea is to stop creating one big index for all metadata sets because of these disadvantages:

  1. All metadata formats are unified to a common denominator. The richer metadata is ignored.
  2. All queries are unified to what one specific type of index supports. Not all is full-text.
  3. Update processing becomes a bottleneck, especially when trying to avoid 1 and work around 2.

This causes organisational and technical costs to rise while failing to deliver on specialization.

Specialization
The Open Index does not integrate indexes but it integrates search results. Integrate; not federate! This allows maintainers of specialized sets to make their own choices regarding metadata, technology and update processing when creating their own, independent indexes. These indexes then join in a bigger Open Index by providing unified identifiers and a standard search protocol.

Unified Resource Identifiers
Each index must use URI’s to identify what is in the index. This is just a good practise being applied widely and increasingly, but it is essential.

Standard Search Protocol
Each index should implement a standard protocol for searching. The query language is a variable (see 2), but standardizing on one or two does not hurt. What is essential is the two types of results that must be supported:

  1. A top list of the complete records for the best ranked results.
  2. A complete list of only the URI’s of all the results.

Peer to peer
Indexes are arranged in a peer to peer fashion. Any index may deal with user queries and will then be called the leading index for the duration of one particular query. User queries are handled by returning results of type A. The leading index uses type B results from other indexes to fullfill the request. The algorithms are packaged in reusable components and deserve a separate blog post.

What’s next?
To explain the concept clearly, I will write some more blogs about:

  • Query resolving: about how the actual integration works.
  • Finding and selecting indexes: how indexes find other indexes to work with.
  • Efficiency optimizations: what is needed to make it work with large indexes and big query loads.

Or call or e-mail me if you don’t want to wait! I am happy to discuss this subject.

Posted in technology, vision | Tagged late integration, library, lucene, meresco, metadata, open index, openindex, owlim, search, solr | Leave a reply

Categories

  • news
  • technology
  • Uncategorized
  • vision

Tags

autocomplete late integration library lucene meresco metadata openindex open index owlim search solr

Archives

  • February 2013
  • November 2011
  • October 2011
  • July 2011
  • May 2011
  • August 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
Theme: Customized Twenty Eleven | Blog at WordPress.com.
Follow

Get every new post delivered to your Inbox.

Powered by WordPress.com