[RFC PATCH:app/xprop] Print UTF8_STRING type as UTF-8 when locale supports it
cloos at jhcloos.com
Sun Oct 18 22:56:49 PDT 2009
>>>>> "Yang" == Yang Zhao <yang at yangman.ca> writes:
Yang> Currently, when an invalid UTF-8 string is detected, an error message is printed
Yang> instead of the string value. I don't think this is ideal. What would be better?
You are correct that printing "<Not a valid UTF-8 string>" is not ideal.
If the utf8 string is invalid, I'd print out the same output that the
existing version of xprop(1) prints. Or perhaps print out the valid
parts with backslash-escaped data for the invalid parts.
Also, it would be good to add a comment explaining the logic used by the
is_valid_utf8() function, notably including a specification of what it
is verifying. (Ie, is it just verifying the basic progression of:
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
or does it confirm that the codepoint in in [0x0000,0x10FFFF],
and/or not a surrogate pair, and/or not a dangling surrogate,
and/or that ((cp & 0xffff) == 0xffff || (cp & 0xffff) == 0xfffe)
is false, and/or that the utf8 is shortest form. You get the idea.)
Other than that, I'd Sign-Off on it.
We do need to make a policy decision on how strict the utf8 check
should be. Outputting a dangling surrogate is unwise, but whereas
a non-BMP character should be encoded directly to utf8, and whereas
a maximally strict utf8 decoder should reject any surrogate code point,
it is not so obvious whether a program like xprop(1) should reject utf8-
encoded surrogate pairs, recognise them and output the proper utf8 for
the resulting character point, or just output the pair and trust the
next stage of the pipe to handle them. A similar argument can be made
for code points beyond the [0x0000,0x10FFFF] range which utf16 can handle.
Code points like 0xfffe, 0xffff, 0x1fffe, 0x1ffff, 0x2fffe, 0x2ffff, ...
should not be output, though. Also, non-shortest-form utf8 like C0/80,
E0/80/80, F0/80/80/80, F8/80/80/80/80, FC/80/80/80/80/80 (all of which
are the same codepoint as (char)0x00) can cause havoc.
Some people in the wild will claim that allowing any utf8 beyond those
permitted by the most strict, limited and limiting TUS-specified version
of the utf-8 specification is inherently a security hole and may file
bug reports if the verification does not match thier ideas. None the
less, that is probably more limiting than xprop(1) needs to be.
James Cloos <cloos at jhcloos.com> OpenPGP: 1024D/ED7DAEA6
More information about the xorg-devel