Plucene up and running, sort of...
2005-02-01 17:14

Since my last posting, several things occured on the Plucene front. Some of them good (the XML interface works in general), some of them bad (Plucene Bugs and Workarounds). At this time, the Plucene Request Processor I built is not suitable for Daemon usage, but when using the simple exec interface that is planned for a first solution, there should be no major problem.

First of all, I (still) absolutely hate badly documented software. Writing the Plucene interface eat up quite some time while I read the source to get a vague idea what they are doing in there. It does work now, but it has a few cavets.

If you want to check it out yourself, download this example tarball. It requires Plucene, LibXML and XML::Writer to be installed and working. the test.pl script takes requests on STDIN and answeres on STDOUT, while the bench.pl script tries a stress test. Increase its loop count by a ten-fold to produce the Out-Of-Filehandle error.

Unclosed File Handels somewhere in Plucene

This one cost me the most time here. In my test scenario, the file handels to the Plucene Index files are not closed when they are no longer needed. It took me several hours to bring my own code into a state where I can be reasonably sure that it is not me that is responsible for this. I don't have certainity though, any ideas welcome.

What I think is, that there are a few circular references which are not caught by the Perl Garbage Collector in time to avoid using up the available file handels.

This, in general, seems to be a known Bug of Plucene, so I added a couple of notes to it. I'm a bit sceptic though about getting a response, the Bug is open since last March...

Performance

This is a bit of a happier thing. It is not too bad, or, in other words, it could have been worse. I run a few test scenarios on my 256 MB, Celeron 1000 test machine.

While processing a single request, an average runtime of 500 ms is required to process two very simple queries in a very small index. So most of this time is either startup overhead or XML processing, as my index consisted of exactly two documents ;-).

Running this query 500 times, with XML parsing but all with the same Indexer objects requires an average of 22 seconds giving a average of 44ms per query itself. Given the small index I tend to say that this is a good number for the minimum overhead on each request when we work with a daemon instead of an shell script. Quite faster then the 600 ms for a single query above.

You're right, this is not really a production number, but it is a indication what we are dealing with here.

The really interesting point was a different one. While running the 500 requests, I could not measure any difference in runtime between LibXML processing with and without validation. Which is a good thing, as it saves you a lot of code cheking as you can just rely on the DTD.

Some Details

In case you want the detailed numbers: I did five test runs with 500 iterations each using the time utility to measure runtime. Numbers given are based on the user time, not counting the system time. The XML request was read from disk everytime (presumably cached, so its more like reading from memory), and the output was sent to /dev/null. Thus there was only minimum overhead in getting requests into and responses out of the system.

When thinking about this, you could get the feeling that either DTD input validation is trivial or that LibXML does this validation alyways, just ignoring errors when validation is "turned off". I have not searched the 'net about this, but I tend to the second option. Nevertheless it seems fast enough to me to be viable in production use.

Where are we now?

Now if I knew that for sure...

As a matter of fact, I do now have a complete Plucene interface suitable for use on the command line. It can be invoked by PHP, and writing the appropriate XML interface code in PHP should not be that hard. The only thing not yet in CVS is an actual command line script that can handle it in a safe manner. Should be a breeze.

What is really bothering me is the question of Plucene support. I'll find out I guess. The Bugs in CPAN which nearly a year old do not really encourage me.

Wether or not I can recommend Plucene with a clear conscience will depend on its behvoir when I had time to test it a bit using a bigger, real-life index. Perhaps Nathan Syntronics will be my first test subject when the code is complete.