Plucene again, and a few things about MidCOM Permalinks
2005-02-03 11:51

I have stopped to be really annoyed about Plucene, especially since I discovered that the unclosed Filehandels are truly coming out of Plucene. Instead I worked on something more promising, you need some diversions here and there, or you'll run mad, you know: I finally added PermaLink support to MidCOM, along with a few other Gotchas.

Plucene again

While optimizing the code I have written, I were finally able to track down the open-filehandle-problem of Plucene a bit more. Not that it was fun, after all I have enough trouble finding my own bugs...

It seems to me, that Plucene duplicates file handels while querying the index. The corresponding objects are most probably having some cyclic references which prevent them to be destroyed by the Perl GC. No idea how intelligent that one is in regard to cyclic references. They are one of the nastier things you have to worry about when writing a Garbage Collector.

Here is an excerpt of what I wrote to the Plucene Bugtracker about this problem:


What I found out is interesting, after around 100 iterations my
application had these file handels open:

1 _8.f1
1 _8.f3
1 _8.f4
1 _8.f5
1 _8.f6
1 _8.f7
1 _8.fdt
1 _8.fdx
237 _8.frq
1 _8.prx
1 _8.tis


As a matter of fact, these dangling objects are causing a memory leak:

Script start : 9,0 MB VM size
After 3000 iterations: 34,5 MB VM size

Not nice, as it does rule out Plucene to be used in a Daemon-like manner, unless the GC can somehow be convinced of killing these objects. I don't think so though.

What really annoys me is that I do not get any response from the developers on both the Bugtracker and the mailing list, to which I wrote. Not for this, not for the other Bug I have reported.

This more and more brings me to the point where I think that it might not be wise to recommend Plucene for larger deployments.

MidCOM 2.1.2 released

This was rather quick release after 2.1.2 last Friday. I did it mainly because of the PermaLink feature I built into it. I wanted this into the 2.2.0 final release, so that component authors can make use of it as soon as possible. 2.4.0 is a bit too far out for this, for my taste at least.

The change consists of three important parts, which both component and site style authors will like:

First of all, NAP now automatically constructs a fully qualified URL to all on-site leaves and nodes. You do no longer need to track anchor prefixes or the like. Instead use $node[MIDCOM_NAV_FULLURL] or $leaf[MIDCOM_NAV_FULLURL] respectivly to obtain a complete URL like you can see in this screenshot. I don't know why I didn't do something like that earlier... With this comes a new get_host_prefix helper function in midcom_application, that provides the fully qualified URL to the current root page of the website, taking both protocol (http or https) and non-standard port numbers into account. The code has been taken from the relocate handler.

Note to self: The component's anchor prefix, which is currently host-local, should be using this helper too, so that the URLs are always complete.

Parallel to this, NAP has learned to resolve a GUID to a NAP object. To keep performance up, a three-stage approach is taken:

  1. If the GUID refers to a topic, load it, verify that its in the tree, then return it. This will capture all nodes.
  2. If the GUID refers to an article, load it and its topic, verify that the topic is in the tree, load its NAP data and scan for the GUID in the NAP leaves. [1]
  3. If the GUID is something else, we do a full scan of the content tree until we have located the leaf. [2]

Finally, a new URL method in the MidCOM core does glue this together. It takes URLs of the form midcom-permalink-$guid, resolves the GUID and relocates to the page designated by the MIDCOM_NAV_FULLURL key in the NAP structure.

To make things easier for you, all NAP objects do now also contain a MIDCOM_NAV_PERMALINK key which will contain the fully qualified PermaLink URL to the leaf or node in question. Just in case a leaf does not have a GUID accociated with it (shouldn't be the case as of the 2.1.2 release), it defaults to the FULLURL of the object.

[1]: I don't use the article GUID directly, as I cannot reliably resolve this to a NAP object. Besides it would circumvent articles that are not visible on-site (that is, leafes that do not have a MIDCOM_NAV_SITE value).

[2]: Comparison of the nodes is not neccessary at this stage, as nodes are caught with the topic check always.

Example code to show this information

Use something like this to display this information on your site for testing purposes:

Note, that I have replaced the < and > braces with [ and ] ones, as HTMLarea is just too dumb for this :-(. Also, the lines have been wrapped for display purposes.

  $nap = new midcom_helper_nav();
$node = $nap->get_node($nap->get_current_node());
echo " Component for this node: {$node[MIDCOM_NAV_COMPONENT]}
echo " FullURL for this node: [a href='{$node[MIDCOM_NAV_FULLURL]}']
echo " GUID for this node: {$node[MIDCOM_NAV_GUID]}
echo " Permalink for this node [a href='{$node[MIDCOM_NAV_PERMALINK]}']
$leaf = $nap->get_current_leaf();
if ($leaf !== false)
$leaf = $nap->get_leaf($leaf);
echo " FullURL for this leaf: [a href='{$leaf[MIDCOM_NAV_FULLURL]}']
echo " GUID for this leaf: {$leaf[MIDCOM_NAV_GUID]}
echo " Permalink for this leaf: [a href='{$leaf[MIDCOM_NAV_PERMALINK]}']

One known inconsistency

When working with components that have explicit index articles you run into one duplicate problem, since the index articles are contained within the NAP data of at least the taviewer at this time (I'm not a friend of this actually, but the majority did want it this way):

When requesting the index page of a taviewer, you will get these results for the node and leaf:

Atlantis > taviewer

Component for this node: de.linkm.taviewer

FullURL for this node: https://[...]/taviewer/
GUID for this node: e5b8fe905e750957022c1f8619850495
Permalink for this node https://[...]/midcom-permalink-e5b8fe905e750957022c1f8619850495

FullURL for this leaf: https://[...]/taviewer/
GUID for this leaf: 4a9abab7fdf8c33e01bf1072b58d3dab
Permalink for this leaf: https://[...]/midcom-permalink-4a9abab7fdf8c33e01bf1072b58d3dab

As you can see, you aer getting two different Permalinks for the same object. Which is a bit confusing in the first place, though it is logical (taking the internal structure into account). I have no real preference about what to do here.