MidCOM 2.4 Status Report: Indexer Progress, Minor Goodies and Release Plans
2005-03-15 21:17

(513x698, 14.278 Bytes)

The MidCOM Indexer is almost complete. As you can see on this site, it is already running and fairly stable. See also my last Blog posting. The only thing missing now is the integration in about five components, that still do not support it. Even with this small reservation, MidCOM 2.4 is nearing completion.

MidCOM Indexer

In the last few days, a few goodies have been added to the Indexer. Two important features were missing to finally achive maximum usability: Attachment Indexing and Advanced Result Permission Checks.

Attachment Indexing

The new document type midcom_attachment is used to index attachments in the Midgard database. Since direct indexing of the binary files is not really sensible, the document type distinguishes the attachments using their Mime-Type:

Simple text files are indexed as-is, Markup Text (like XML or HTML) have their Markup removed and RTF, PDF and MS-Word DOC files are converted to plain text using unrtf, pdftotext or catdoc respectivly.

All other files are treated as binary, indexing only their Metadata.

The main problem here is the fact, that especially with simple text fields I have about no chance to identify the character encoding used for the file. Since I need valid UTF-8 for the XML request, I had to resort to mb_detect_encoding to try to autodetect the contents encoding. This will work well for files which are already UTF-8 encoded, but most single byte encodings will be detected as ISO-8859-15. This might or might not be good, but it is difficult to find a more flexible solution in this respect. This applies not to PDF and DOC files, as both pdftotext and catdoc are character set aware, being able to convert to UTF-8 directly.

Another problem related to the character sets is the fact, that in some files may contain control characters other then CR/LF, which are, too, invalid in fully compliant XML. I have tried to filter them out using what PHP calls "UTF-8 aware Perl Regular Expressions", but I honestly do not trust them.

Attachment Indexing Datamanager Integration

Of course, to be really useful, attachments should be indexed automatically. This has been done by extending the blob and collection types to support indexing themselves on changes. Unfortunalety, the datatypes are not yet documented, so you'll have to stick with what I write here for now:

The default behavoir for blobs is to index themselves, while images do not. Collections only relay the calls to their child elements. You can modify this behavoir by setting the key datatype_blob_autoindex in the field configruation to true or false, according to what you want.

The code in place will automatically catch all changes to a given datamanager driven content object. This includes reindexing, in which case all blobs get reindexed for datamanger documents automatically, you don't have to think about this problem at this point.

Error resilence

MidCOM will fail silently if indexing fails for whatever reason. This has been done to not hamper continuing usage only because a single document failed to index. Right now, the most probable error condition is an incorrect character in the XML stream, produced by the attachment indexer as outlined above.

Intrestingly, Kaffe and the original Sun JRE seem to treat this error condition differently. Kaffe feels stricter compared to Sun, rejecting XML files Sun would normally accept. I have not yet investigated what exactly goes wrong in these instances (as finding suitable attachments is not that trivial). So my recommendation is to use the Sun JRE if you want to minimize unindexed documents due to such errors, and to use Kaffe if it not so important if a few of the more strange attachments are not recognized correctly. With Nathan, there are perhaps 1% of the attachments, that did fail, all of them either text/plain or text/richtext.

Advanced Result Permission Checks

Before today, MidCOM query results were only filtered by a few default rules, using topic visibility and metadata visbility as a default. But, obviously, this is not enough. Various components (or, in the future, external sources) have their own advanced rules about visibility.

net.nemein.incidentdb for example shows only the results you own, unless you are a member of the management group. To map this behavoir, a component can specify three "modes" of security beyond the default mechanisms described above:

First, and most useful, there is the component security mode. It calls the event handler method _on_check_document_permissions of the component interface class. It then returns true if the object may be displayed, false otherwise. While this callback may modify the passed document during its runtime, this should be avoided unless absolutely neccessary, most of the time you have alternatives to that.

Minor Goodies

A few nice additions have been made throughout the system, easing development work.

NAP got the new key MIDCOM_NAV_OBJECT introduced. It stores an instance of the object identified by MIDCOM_NAV_GUID. It is there for performance reasons, as many other classes working with NAP often require the object itself. Since the introduction of the metadata facilities, loading each object in question is required nevertheless. Components may return the object for leaves, if they have already retrieved it, if not, NAP will load it during preprocessing of the leaves.

The debugger now supports a simple debug_push_class helper, used to produce $classname::$functionname debug prefixes. For maximum lazyness, you can now use debug_push_class(__CLASS__, __FUNCTION__).

Release Plans

Finally, it looks like we are nearing the completion of all features required for MidCOM 2.4. The indexer seems to work fairly well, component integration looks fine, and the new Navigation System is proving itself for some time now in all of my testbeds.

To prove my point, Nathan Syntronics, the site you're reading at the very moment, is running CVS HEAD for about a week now, deploying the indexer with great success. Upgrading went like a breeze (if I would have liked, I could have approval for my entire website now for "free") and performance looks good, especially from the indexer part.

I will release a first beta-release tomorrow, before I leave into vacation, to give you time to take a look at the new release. If it works as well for you as it does for me, we will enter feature freeze at the beginning of April, with a final release out perhaps two weeks later.

This release will also go into the next stable Midgard Release 1.7 as a new default MidCOM installation.

Recommended Reading