Why do we need a DB abstraction for MidCOM

Up until and including the MidCOM 2.4 strain, all components accessed the MidCOM database directily, without an interface within MidCOM. Finally, when starting to integrate MgdSchema into what is eventually going to be MidCOM 2.6 I decided it was time to change that. Why and What I want to change is outlined in this article.

So what annoys me at what current MidCOM does? To put it simply, I, as a framework developer, can't hook in any checks into the database queries without implementing it in all components. This is mainly of interest when it comes down to a) access control and b) more advanced forms of cache invalidation.

But this is not all.

Right now there are several ways within the MidCOM source tree where component authors (including myself) encaspulate various Midgard objects in their own component-level classes. None of them are standardized in that they provide framework-driven features.

This gets especially important with the transition to MgdSchema, where right now both Rambo and I have started to write our own wrapper classes around the yet-incomplete MgdSchema objects out of sheer neccessity. While Rambo went for a more localized approach, I am currently trying to get a general solution up and running for MidCOM.

My aims are to hide all these nifty little details from the component author, providing an extended interface over what MgdSchema offers at this time on a PHP level. In the future, part of these features can then be superseeded by a core implementation (thus the full encaspulation) for performance reasons. Also, I want to have an easy way of adding new MidCOM features to all components at once by adding a single patch of code to the core classes, instead of changing all classes out there in the wildlife.

As usual nowadays, one major consideration will be performance, I have neglected this particular point for too long now. (Well, actually I would have never guessed that somebody uses MidCOM with hundrets and thousands of leaves within a single topic...) One point here will be that there won't be any fully-automatic ACL checks through the entire framework. Access control will still be done very selectivly, not generally.

The main reason for this is that merging the ACL is quite time consuming, and the tree inheritance structure we have prohibits the check to be on a simple SQL level. So both core and component authors should keep a very careful eye on the ACL checks they do, to keep them at a maximum optimization level.

Also, which has to go on my todo list, there is need for an ACL cache, so that the merging of the information does not have to be done everytime for every access to a content object.

All this will need some careful abstraction, so that all requirements can still be met. And, what is more imporant, it has to be good enough so that I can integrate the requirements that will arise in the future.

What do I want to introduce exactly?

What I have started now is to write a set of base classes, inherited from the current set of Midgard Schema classes. Since PHP does not support multiple inheritance, all functionality has been put into a single class, and the developer needs to subclass from the MgdSchema class and add all corresponding "external" member functions. This sounds more complicated as it is, as all classes look just about the same, with a single difference in the class/constructor name and three or four lines of callbacks.

If you look at the source of the MidgardArticle wrapper class:

The minor differences between each class lies solely in the name of the class and its inheritance base class (lines 27 and 37) and at the wrapper for the yet missing delete operation in line 224, which will be obsolete in the very near future.

One point, which remains currently uncovered by MgdSchema, is an intelligent parent loader. MidCOMs idea of a content hierarchy (which I am not prepared to sacrifice) defines that each object has exactly one parent. It may have multiple links to other objects, but it must have that single parent. If we give up this constrain, the ACL inheritance will ultimately fail.

Thus, to faciliate this we have a get_parent method, starting in line 396. It has to return the object which is the immediate parent of the current object, or null if there is no such object.

All other things relay to the database object base class, which is considered what C++ calls a "friend" of the DB object and may therefore safely access private members of that class. This essentially allows us to use the same semantics as if we would have multiple inheritance.

How can we effectivly maintain this?

Now this is the million dollar question. More or less at least.

Maintaining these wrapper classes by hand is certainly no option. Especially as the basic interace (which is identical for all classes) can change over time.

So the next idea I had was writing a simple code generator so that the average developer can automate the task of building these wrapper classes. Then again, I thought, this is not really a good solution as we still don't have the kind of automatism you would have with real-life wrapper classes. Especially when I want to introduce new features.

Ok, then we need a Plan C (no, not from outer space).

At this point in my thoughts, I remembered the way J2EE does this. They too face similar problems when they should implement stuff like J2EE container managed persistance operations. The solution they implemented was creating classes on-demand when applications were deployed the first time.

What I find intriguing here is the fact that this keeps performance at a tops (as it produces regular code) while not inhibiting the flexibility of development.

So lets continue this idea of the integrated db abstraction code generator.

The general idea

So what do we need in the end? The developer wants to have a class he can use as usual, without much hazzle and which is derived from a MgdSchem class like NewMidgardArticle.

That class should ideally consist only of those methods that need to be overridden on a PHP level like get_parent. All other things should be provided in some parent class instead.

In addition, due to the fact that we do not have a full inheritance hirarchy, we will need some way of automatically determine information related to the class. Especially interesting here is the name of the original MgdSchema base class, the name of the MidCOM base class that should be used and the name of the table (currently called "realm").

This meta information is especially important as MidCOM needs to be able to convert to and from almost all kinds of Midgard objects that we may encounter.

With this information, MidCOM will implicitly generate a intermediate class, from which you in turn derive your application level class. These classes will be explicitly bound to a component or the MidCOM core, with a strict namespacing.

A first example

Let us look at the articles like they are now. Articles are historically stored in the table article, with the old Midgard class being named MidgardArticle and the MgdSchema class being named NewMidgardArticle. The information MidCOM will have for this class looks just about this then:

'table' => 'article'
'old_class_name' => 'MidgardArticle'
'new_class_name' => 'NewMidgardArticle'
'midcom_class_name' => 'MidCOMArticle'

With this information, we can already start off, defining an intermediate class:

class __MidCOMArticle extends NewMidgardArticle
{
// Auto-generated interface code with stubs for all callbacks
}

Note the double underline prefix, which indicates that it is an intermediate class not intended for direct usage.

The application developer, in this case the MidCOM core team, will in turn inherit from this class, creating the real-life instance.

class MidCOMArticle extends __MidCOMArticle
{
function MidCOMArticle($id = null)
{
// Keep the constructor chain intact.
parent::__MidCOMArticle($id);
}

// Override what you need from this point on.
}

From this point, you can either use your class directly (in case of component specific classes) or again inherit from the class (MyArticle extends MidCOMArticle).

Of course you can also define classes which are only found in MgdSchema. In that case you simply set the old_class_name property to null, indicating just this.

Organizational issues

A main point that arises here is the update semantics. Obviously, these auto-generated classes need to be cached as live-generating them is inefficient.

For me, the most natural way would be having definition classes looking roughly like our fist example above which map to class file generated by midcom. If the definition file is newer then the class file (a quick test), the class needs to be regenerated.

A similar test can be done with the last modification time of the actual class builder, so that classes are automatically regenerated when the class builder changes.

The classes in question will be located in the MidCOM cache directory, and it should also be possible to have multiple classes defined in a single file, for ease of management.

All classes will be loaded when the corresponding component loads.

Depending on whether you plan to share the classes between components it is recommended to put the class definition into a shared library within MidCOM which is loaded by the corresponding components.

With some components it could make sense to keep the defined classes with a full blown component, for example if you build a component like n.n.orders which provides ways to "remote control" it.

How does MidCOM detect these "defined classes" then?

These classes will be defined in a part of the components' interface. When the component loader starts up a component, it will read the list of defined classes and invoke the class loader/builder for it. This is the same place where ACL permissions are defined by the way, so we need extensions at this point anywhere.

With the new component baseclass it should be as easy as adding something like this to the interface class constructor:

$this->_autoload_dbclasses = Array('myclassdef1', 'myclassdef2');

... where the class definitions are looked up in the config directory of the component.

These files are largely structured like MidCOM schema databases, being an Array definition. For performance reasons the class loader/builder will support multiple class definitions within a single definition file. So a full class definition file could look like this:

--- START OF FILE ---
Array
(
'table => 'article',
'old_class_name' => 'MidgardArticle',
'new_class_name' => 'NewMidgardArticle',
'midcom_class_name' => 'MidCOMArticle'
),
Array
(
'table' => 'topic',
'old_class_name' => 'MidgardTopic',
'new_class_name' => 'NewMidgardTopic',
'midcom_class_name' => 'MidCOMTopic'
)
--- END OF FILE ---

As you can see, this is essentially an array of arrays which is defined here. The main difference to the MidCOM schema databases is the fact that the arrays are not indexed explicitly.

The main advantage here lies in the faster invalidation checks. The class builder/loader will generate a single PHP source file per class definition file, it will not generate a file per class.

Summary

While this might look like overkill on first sight, there are a few points I like to repeat why I propose such a piece of code for MidCOM:

First, it will finally bring all operations within MidCOMs grasp. This is a thing for the future, used only very limited right now, but it allows me finally to influence and control every object during runtime. On the long run, this will make many things easier.

Second, it will make building transition code more easy, as you only have to modify a single place instead of over a dozen classes (perhaps more with OpenPSA 2). This is interesting for both the MgdSchema and the Multilang transitions within MidCOM.

Third, these cached classes are the only way to keep up both performance and the advantages of object inheritance. The latter aspect is mainly about the Human Factor: All points where you work with non MidCOM objects right now you have to remember to call up MidCOM hooks. This can be avoided in the future, for example making the frequent of invalidate() calls obsolete.