The MidCOM Indexer

The MidCOM Indexer is the greatest new invention in MidCOM 2.4. It allows on-the-fly indexing of any MidCOM driven Website. It uses an external Indexer Daemon built on Lucene to store and retrieve this Information.

Recommended Reading

General Structure

The Indexer is not directly indetrated into PHP. On one hand, the performance in a persistently running indexer daemon is (on average) better, than a fully integrated solution. On the other hand, there is just no really usable PHP-level On-Demand Indexer out there anyway (not that I would trust PHP far enough in this respect.

The structure of the index is further described in the mRFC 9. See there how Documents and Fields interact.

This document will focus on setting up and using the Indexer.

Setting up the Lucene Daemon

All required files of the current CVS state are available for Download on this page, but if you want an up-to-date build of the system, follow these instructions:

Building and Installing the Daemon

  1. Go to the Lucene Website and download the latest Lucene binary tarball. In it you will find a file named lucene-$version.jar. Rename it to a plain lucene.jar.
  2. Go to the external-tools/indexer-backends/lucene directory of the current MidCOM CVS. Copy the lucene.jar file into this directory and run make there, it will build a file named indexer.jar.
  3. Create a directory and copy the files lucene.jar, indexer.jar, xml-communication-request.dtd and xml-communication-response.dtd into it.

Running the Daemon

Go to the newly created directory, with an user account that has write permissions to this directory. Run java -jar indexer.jar.

You should be fine from that point, the daemon will listen to 127.0.0.1:2222, which is the default setting from the MidCOM side too.

The Daemon will run in foreground by default, unless you launch it with some nohup wrapper. There is no init-script yet.

Changing Configuration

The daemon will take a filename during startup as first command line argument. A full configuration file looks like this:

logfile = 
loglevel = WARNING
bind = 127.0.0.1
port = 2222

What I wrote here are the defaults, log warning level messages to stderr (no log file) and bind to 127.0.0.1:2222. Check java.util.logging.Level for valid logging levels.

You should be fine using the defaults though.

Configuring MidCOM to use the Indexer

This is relativly easy, and consists of three tasks:

Activate the indexing feature in the MidCOM configuration

As usual, you will find all detailed information in the MidCOM API docs, section midcom_config.php. As long as you stick to the default configuration, it is enough to activate the XMLTCP indexing backend during MidCOM startup:

$GLOBALS['midcom_local_config']['indexer_backend'] = 'xmltcp';

Two more configuration options, indexer_xmltcp_host and indexer_xmltcp_port, allowing you to explicitly specify the host/port combination where the indexer runs.

Index your entire site

Unless you are building a site from scratch, you obviously have to reindex your entire website. This is done by accessing this URL with full admin privileges:

http://your.site.com/midcom-exec-midcom/reindex.php

Two important notes about this: First it will take quite some time, as the MidCOM side of the interface does not yet support batch indexing. Second, and far more important, Reindexing will take a huge amount of RAM. Reindexing this website requires about 60 MB of total RAM, having around 250 documents on it. Right now I believe that I have not done anything wrong, that it is just PHP that does not free any memory that it no longer uses. Of course, I can use any help in looking over the reindex code in this respsect.

Create the Indexer Frontend

Create a new topic using the component midcom.helper.search.
 

Downloads

File Size Last modified
Response Document DTD 431 Byte 2005-03-10
Request Document DTD 1.137 Byte 2005-03-10
The Indexing Daemon 16.548 Byte 2005-03-15