UTF-8 mode with MidCOM

New sites should only use UTF-8 mode, in my opinion this is the only way-to-go nowadays, it gets rid of all these charset troubles. Also, several new features in the core (for example the Mail Template stuff) work best with UTF-8, as this saves you a lot of conversions and potential trouble.

Putting a MidCOM site into UTF-8 mode requires several changes on the complete system, this summary works and is currently driving this website, but it is by no means throughoutly tested. Comments welcome.

Apache-level settings

The majority of the configruation work is done here.

Midgard-PHP Module settings

The default, so-called latin1 parser from Midgard, has several points, in which it converts non-valid ISO-8859-1 characters into UTF-8 entities automatically, roughly analogus to the PHP function htmlentities(). Of course, this runs into great trouble with UTF-8, as most UTF-8 entities get corrupted by this replacement.

The russian parser mode is an alternative here, though it is called Russian, it has actually nothing to do with the characters in use in that country, instead the replacement described above is reduced to something equivalent to htmlspecialchars(), which will leave UTF-8 entities intact:

MidgardParser russian

This will also deliver a new content type header, which includes a charset=utf-8 hint.

PHP settings

PHP too hase a couple of settings it requires to work correctly with UTF-8 encoded text. The most important point here are the multibyte string function overloads, so that all string functions are replaced by ther mb_ consorts.

php_value default_charset UTF-8
php_value mbstring.func_overload 7
php_value mbstring.internal_encoding UTF-8
php_value mbstring.detect_order UTF-8

This will ensure, that you can work like you are used to while having UTF-8 as charset.

Important Note: Appearantly, these settings do break the MetaWebLog API interface to the newsticker, which is not compilant to working in UTF-8 mode for whatever reasons unknown to me. As long as you do not use this feature, don't disable these lines (see Bergie's Blog). Maybe it would even be viable to shut down these values only for the MetaWebLog interface somehow using Apache Location Magic.

Disabling these features will break many string operations in MidCOM in an difficult-to-track manner.

Apache settings

Finally, Apache too should get a couple of directives for mod_charset, so that static files are working with UTF-8 too:


CharsetDefault utf8
CharsetSourceEnc utf8
CharsetDisable off
AddDefaultCharset utf8

Midgard Components settings

First, and most important, if you still use Midcom 1.3.x, you should upgrade to the latest 1.4.0 technology preview, as it has a couple of changes not in the 1.3 strain which are important in UTF-8 operation.

While the above settings are enough to get a regular Midgard application like Asgard running correctly, a MidCOM site needs a few extra settings, so that things like the caching engine or the l10n service can create output in the right encoding. Let us take a look at all available settings:

// $midcom_cachemultilang = false;
mgd_include_snippet("/midcom/midcom");
$i18n =& $midcom->get_service("i18n");
// $i18n->set_language("de");
$i18n->set_charset("utf-8");

The three not commented lines are the bare minimum you have to use, after setting up the MidCOM environment, you get a handle to the central i18n service and force it to use UTF-8 instead of the default charset accociated with the client's detected language (usually through HTTP negotiation).

The full example shows a way to force a website into a single language, which is useful for actual websites, but not for AIS. You might need this, as the default component styles are now mostly localized. This lead, for example, to an English website, where a Finnish visitor finds a couple of Finnish texts spread throughout the various pages. It is important here, that you call the set_charset method after the set_language method, as set_language will reset the charset to the default one accociated with the new language. Finally, the first statement will tell the caching engine not to distinguish between different language versions in the cache. If you lock the i18n core to a given language, you should also set this feature (before initializing MidCOM!), this will improve the cache's effectiviness greatly.