COMPOUND_TEXT versus UTF8_STRING

Thu Sep 23 01:52:20 PDT 2004

Keith Packard wrote on 2004-09-22 18:57 UTC:
> I'd rather just allow UTF-8 text in the existing ICCCM properties, but I 
> think that doesn't provide a reasonable transition strategy.

There is certainly no harm done by encouraging in the next ICCCM version
the recipients of all the properties where STRING and COMPOUND_TEXT are
allowed today to also accept UTF8_STRING, in addition to the existing
STRING and COMPOUND_TEXT ones.

In addition, there is little harm done in using UTF8_STRING whenever the
text to be transmitted contains at least one character for which STRING
and COMPOUND_TEXT provide no encoding (think of Ethiopian or Vietnamese
window titles). Unless the recipient understands UTF-8 (and therefore
probably also implements already UTF8_STRING), the data will be
meaningless to them anyway.

In addition, I wouldn't deprecate STRING quite yet. Instead, if the
original property text contains only characters in the range U+0000 to
U+00FF, then the source of the property should still choose STRING (ISO
8859-1) for quite some time, otherwise it should choose UTF8_STRING. ISO
8859-1 is not as expensive as any other encodings, because it can be
converted to Unicode without any conversion table.

Our long-term architectural goal should be a world in which we can do
without the many megabytes of conversion tables that we carry around at
present in almost any text-communication library separately. Conversion
tables must really be only a temporary hack, not a long-term feature.

Another helpful heuristic might be that any client who runs in a UTF-8
locale (i.e., strcmp(nl_langinfo(CODESET), "UTF-8") == 0 on any
POSIX:2001 system) is most likely dealing with other clients who
understand already UTF8_STRING, and can therefore dare to use
UTF8_STRING in its properties. COMPOUND_TEXT would then only be used by
clients running in non-UTF-8 locales. This neglects of course that in a
distributed heterogeneous system clients might run under many different
locales, but in practice it might be good enough in >>95% of all cases.

> One strategy which might work is to have the window manager place a property on
> the  root window indicating support for UTF-8 encoded strings in window 
> properties, but even that seems problematic for other applications using 
> those strings.

Having the window manager enumerate in a root window property the list
of encodings it can handle is also a very nice approach. In practice,
are there many other applications than window managers using these
properties? And which of these are not distributed in a way (e.g., as
part of a toolkit that includes also the window manager) where they are
not typically upgraded *together* with the window manager?

Introducing all new properties is of course a safe way of doing things,
and smells very much like the classical design-by-committee way of
things: robust against backwards-compatibility concerns, but overall
rather ugly (especially with leading underscores!).

I think it would be nice if we can get UTF8_STRING used eventually in
the original properties, as the ICCCM authors had originally intended
it.

After all, remember that not too long ago, COMPOUND_TEXT was simply
added to the ICCCM as well, without any new properties for backwards
compatibility.

Markus

-- 
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__