The MidCOM Indexer
The MidCOM Indexer is the greatest new invention in MidCOM 2.4. It allows on-the-fly indexing of any MidCOM driven Website. It uses an external Indexer Daemon built on Lucene to store and retrieve this Information.
- mRFC 9: MidCOM Indexing Service
- mRFC 14: MidCOM Remote Indexing Service: XML Communication Protocol
- Class midcom_service_indexer in the MidCOM API Documentation, section midcom.services.
The Indexer is not directly indetrated into PHP. On one hand, the performance in a persistently running indexer daemon is (on average) better, than a fully integrated solution. On the other hand, there is just no really usable PHP-level On-Demand Indexer out there anyway (not that I would trust PHP far enough in this respect.
The structure of the index is further described in the mRFC 9. See there how Documents and Fields interact.
This document will focus on setting up and using the Indexer.
Setting up the Lucene Daemon
All required files of the current CVS state are available for Download on this page, but if you want an up-to-date build of the system, follow these instructions:
Building and Installing the Daemon
- Go to the Lucene Website and download the latest Lucene binary tarball. In it you will find a file named lucene-$version.jar. Rename it to a plain lucene.jar.
- Go to the external-tools/indexer-backends/lucene directory of the current MidCOM CVS. Copy the lucene.jar file into this directory and run make there, it will build a file named indexer.jar.
- Create a directory and copy the files lucene.jar, indexer.jar, xml-communication-request.dtd and xml-communication-response.dtd into it.
Running the Daemon
Go to the newly created directory, with an user account that has write permissions to this directory. Run java -jar indexer.jar.
You should be fine from that point, the daemon will listen to 127.0.0.1:2222, which is the default setting from the MidCOM side too.
The Daemon will run in foreground by default, unless you launch it with some nohup wrapper. There is no init-script yet.
The daemon will take a filename during startup as first command line argument. A full configuration file looks like this:
loglevel = WARNING
bind = 127.0.0.1
port = 2222
What I wrote here are the defaults, log warning level messages to stderr (no log file) and bind to 127.0.0.1:2222. Check java.util.logging.Level for valid logging levels.
You should be fine using the defaults though.
Configuring MidCOM to use the Indexer
This is relativly easy, and consists of three tasks:
Activate the indexing feature in the MidCOM configuration
As usual, you will find all detailed information in the MidCOM API docs, section midcom_config.php. As long as you stick to the default configuration, it is enough to activate the XMLTCP indexing backend during MidCOM startup:
$GLOBALS['midcom_local_config']['indexer_backend'] = 'xmltcp';
Two more configuration options, indexer_xmltcp_host and indexer_xmltcp_port, allowing you to explicitly specify the host/port combination where the indexer runs.
Index your entire site
Unless you are building a site from scratch, you obviously have to reindex your entire website. This is done by accessing this URL with full admin privileges:
Two important notes about this: First it will take quite some time, as the MidCOM side of the interface does not yet support batch indexing. Second, and far more important, Reindexing will take a huge amount of RAM. Reindexing this website requires about 60 MB of total RAM, having around 250 documents on it. Right now I believe that I have not done anything wrong, that it is just PHP that does not free any memory that it no longer uses. Of course, I can use any help in looking over the reindex code in this respsect.
Create the Indexer Frontend
Create a new topic using the component midcom.helper.search.
|Response Document DTD||431 Byte||2005-03-10|
|Request Document DTD||1.137 Byte||2005-03-10|
|The Indexing Daemon||16.548 Byte||2005-03-15|