MidCOM 2.4 Status report: Indexer, new Base-Classes and a dose of Plucene
2005-02-25 10:17

Yesterday's favourite(911x520, 5.230 Bytes)

Yesterday brought quite a few new things for Midcom, for good, and for bad unfortunalety. Lets start with the good ones here, less frustrating. I have completed the demo indexer implementation yesterday, having a single component which fully supports the Indexer (de.linkm.taviewer) and a reindex driver within MidCOM. While I were working at this, I developed a new way for components to interface with MidCOM. The new interface baseclass finally makes almost anything for you, reducing a basic component interface from originally 190 lines to just 15. Finally, when testing the taviewer index interface, I (again...) stumbled on a quite critical bug in Plucene, which fails to optimize an index cleanly (see the screenshot too).

Indexer

The indexer made great progress yesterday. My initial test runs using the taviewer brought up most of the basic issues now, I think. Taviewer now correctly updates the index on all content changes, and it also has a reindex interface, driven by the midcom-exec interface. The code required for this is concise, you can boil it down to something like this as long as you are happy with Datamanager's autoindexing:

$indexer =& $GLOBALS['midcom']->get_service('indexer');
$indexer->index($this->_datamanager);

You should check the MidCOM API documentation of the Datamanager Document class for details about this.

The reindex driver basically does the same, most of the code goes for iterating over the articles and instantinating the Datamanager for it. Performance looks good, though a complete reindex run could easily take a few hours. Plucene's locking seems to be good enough to not hinder actual site workings, including the on-demand indexing of the components. I have not done extensive testing though, just a few quick calls to the shell backend to check for a global style lock.

Next-Generation Component Interface

This is something I should have done a long time ago, but, as usual with these things, you usually just don't think about these things.

Nevertheless, the current component interface (you know, the four classes in interfaces.php which you usually copy&paste when createing a new component) were not really that efficient. Ok. To be honest, the were crap. What we have now is a new component API, which consolidates all this to a single class also delivering a base class for you to use.

Implicitly, this means that both of the component interfaces are not compatible to each other. Fortunalety, it was easy for me to implement a compatibility layer into all relevant classes (mainly the component loader), so that it supports both the old and new interface for now. This will stay throughout the 2.4 release cycle, so that components not hosted in the MidCOM CVS can be adapted. Removal of these deprecated functions is planned for MidCOM 2.6 (whenever that is).

Converting components to that new API will get its own article, as there are a few side effects you should be aware of, as a component author. This document will also cover the implications for the MidCOM Core, as all applications working with components directly need to be rewritten.

So, what do you have now: If you want to have a basic component without any special features (for example indexing support), you can boil down a component interface to this:

class de_linkm_taviewer_interface 
extends midcom_baseclasses_components_interface
{
function de_linkm_taviewer_interface()
{
parent::midcom_baseclasses_components_interface();

$this->_component = 'de.linkm.taviewer';
$this->_autoload_files = Array
(
'viewer.php',
'admin.php',
'navigation.php'
);
$this->_autoload_libraries = Array('midcom.helper.datamanager');
}
}

You might wonder, how you can influence the component interface now. The Interface defines events, which are called at certain situations. Right now, I'm taking a minimalistic approach here. The base API provides a single callback, _on_initialized, which is called after the base class has initialized the component as far as it is able to do. You can add your own initialization code at this point, you must not do it during construction, as the component is not yet set up there. The reindex driver mentioned above too uses the event interface, calling your component interface method _on_reindex.

A new invetion, which cleans up the global namespace, is the component data storage area, housed in the global variable midcom_component_data. It provides storage arrays for all components (indexed by their name). The component base class makes it available to the member _data. You can do the same in your components using a construct like this (for taviewer):

$this->_data =& $GLOBALS['midcom_component_data']['de_linkm_taviewer'];

Most important here is, that the component base class uses this storage area both for the default configuration ('config') and the NAP active leaf ID ('active_leaf'). Quite a good solution, as it turned out, as this is the cleanest way to automate such things. Think of it as a shared memory area.

Another dose of the Plucene drug

Plucene again cost me a few hours, until I realized that the optimize call runs havoc. Take a look at the screenshot what happens. The corrupt data is coming in over stderr. Actual indexing seems to work, though the abstract field is corrupt containing the same strange strings then those in echoed to stderr. Right now I have disabled the explicit optimization of the index after the indexing calls, which seems to solve the problem. I cannot tell though, obviously, what impacts this might have on performance.

I again filed a bug report about this in the Plucene bug tracker at CPAN. Not that I have much hope, as I didn't even get a single word of response on my two previous bugs and the mail I sent to the Plucene development list. Actually, the only mail I received over that list is a Invitation from the Apache Project for Pluecen to join ASF. Not that this message produced a single response on that list either.

For my personal opinion. I definitly will no longer recommend Plucene for production usage. This project is neither mature nor is its code and documentation near anything I would call 'high quality'. Besides, which makes things far worse then they already would be, support from the author(s) is close to non-existant, with important bugs being open for nearly a year now. Sorry folks, but this is not the way software development needs to be done, it is a good example how it should not be done. The bugs themselves are secondary here, if someone would be there actually helping me finding and fixing them. After all, this is a major building block of the Open Source idea.

Or, to put it otherwise, Plucene cost me alread about 15 to 20 hours in isolating Bugs back into the Plucene system, and I don't want to think how many more might arise when we start deploying Plucene for a several-hundret-pages-site with several changes a day.