First steps in the MidCOM (P)Lucene integration
2005-01-26 15:30

Today I started working on mRFC 9, Midgard/MidCOM search engine interface proposal. The main things I did today was to get the Plucene libraries running. After some initial difficulties, mostly related to missing documentation, everything went fine.

For a start, the documentation is average, at best. Many places are outdated or only documented with a one-liner you might understand if you know Plucene. Though, if I already knew it, I wouldn't have to resort to the docs, would I? Nevertheless I finally found out, what the HitCollector Callback from the example actually has to call, and had some search results after all. A dependency list would have been nice as well, so that I'm not required to do trial and error until I have found all packages for that particular piece of source code.

Leaving the documentation aside, Plucene itself seems to be mature enough to be used. Of course, it will have to prove this under a real stress test. The five-document index I have used won't take us very far in getting an impression about its real stability. Execution speed seems reasonable, in the vicinity what I would have suspected from a Perl command line client.

The client code I have written is not very complex and tested only plain text types. There is a date writer interface, but I am not sure if all of this can be used to do range queries over date fields. In general, Plucene does not support range queries, and, actually, this could be a problem in some environments. Again, the documentation is relatively terse on the topic, we'll find out soon, I guess.

In the meantime you can look at the screenshot to see a bit of the code and results I have produced today. I think it looks promising after all.

The good news is, that if Plucene turns out not to be up to the task, we still have the option of reverting to Lucene. The interface between MidCOM and the indexer will use XML as query/response language, and it should be quite portable on the MidCOM side therefore.

For those of you, interested in bugging around with Plucene, you have to install at least these Perl packages:

In addition, you should replace the query code to something like this, the original example won't work:

use Plucene::QueryParser;
use Plucene::Analysis::SimpleAnalyzer;
use Plucene::Search::IndexSearcher;
use Plucene::Search::HitCollector;

my $query = $parser->parse('some be author:henri');
my $searcher = Plucene::Search::IndexSearcher->new('my_index');
my $reader = $searcher->reader;

my @docs;
my $hc = Plucene::Search::HitCollector->new(
collect => sub
{
my ($self, $doc, $score) = @_;
print "Collecting $doc with score $score\n";
push @docs, $reader->document($doc);
}
);

$searcher->search_hc($query, $hc);

The next step at this front will be defining the XML query/response interface, as this has to be solified before actually implementing the backend interface.