PLEX - the PLucene EXperience
2005-01-27 20:00

After another day of documentation-reading, source-understanding, trial-and-error-ing and specification-writing I finally think that I have a grasp on the Plucene engine. I still think that it is usable, you'll just have to handle it with care.

There were a few new noteworthy things about all this:

Summing up, its not that bad, but it is not that good either. Well, I guess I'll just have to live with it. I'm not that sure that a pure Lucene solution would be better. While Lucene is proven, getting Java to run smoothly usually holds its own surprises.

Anyway.

I have got the specification of the XML communication protocol to access remote indexers like Plucene finished so far. While writing them up a few new things came to my mind, which have to be taken care of when writing the PHP layer.

The distinctions between (P)Lucenes storage types is rather crucial for an effective index, for a short summary:

As I already wrote in the XML spec, both date and keyword cannot be reliably queried. While date cannot be identified in a result set (it is just some gibberish that a mudane guy won't identify as a date), a keyword cannot even be queried safely. Nevertheless we need them, as date is the only way to restrict searches to a given date range, and keyword is the only way to reliably store an unique identifier that is used for deletion.

This leaves us with three fields, from which all are useful in theory (so all will be implemented), with the focus on unstored and text.

Why am I telling you this. Good question, simple answer: The current MidCOM-Indexer API draft works on a basis of simple key/value pairs for a document. This is no longer sufficient. Instead, we will need Document and Field classes to distinguish between the five field types available in total. The Document class will be resembeled by an class hirarchy to support the various indexing targets more easily.

Only if we have this distinction during indexing, we will be able to create an efficient index. For example, it does not make much sense to store a complete copy of a pages content in the index, unless you want to do some cached file stuff like Google does. Unfortunalety, I actually don't want to know how much this will impact performance. So the recommendation will be storing as much data as possible as unstored, only using abstracts, metadata and stuff like that with text fields.

So, where do we go from here?

Tomorrow I'll try to come up with an intial Plucene front end that can talk XML like I have specified it. I just hope that my Perl is good enough for that task. In the end, I'll finally learn more Perl, so it might not be that bad...