Ethereal-dev: Re: [Ethereal-dev] Unicode strings ...
On Sat, Aug 11, 2001 at 08:39:37PM -0400, Ed Warnicke wrote:
> Hmm... do you mean strings of wide characters or some other encoding (
> utf-8?).
There are (at least) two questions that come up when adding to the
protocol tree a character set whose contents come from a packet:
1) In what form do we store characters in the protocol tree?
2) In what form is the character string in the packet?
In the packet, the character string might be:
ASCII (one byte per character, and no byte has the 8th bit
set - a/k/a ISO 646 IRV:1991);
other national ISO 646 variants (again, one byte per character,
and no byte has the 8th bit set; some code points have glyphs
other than the ones for ASCII, e.g. for accented letters and
national currency symbols - see
http://www.terena.nl/multiling/euroml/section04.html
);
ISO 8859/n character sets (one byte per character, and bytes
*can* have the 8th bit set; 8-bit supersets of ASCII);
various encodings of 2-byte (non-Unicode) character sets;
various IBM PC/DOS/Windows code pages (some are 8-bit supersets
of ASCII, some are, I think, encodings of 2-byte non-Unicode
character sets);
MacOS character sets;
EBCDIC;
big-endian 2-byte Unicode;
little-endian 2-byte Unicode;
perhaps big-endian and little-endian 4-byte ISO 10646;
UTF-8;
other encodings of ISO 10646;
and so on.
A question then would be how strings in the various non-ASCII character
sets should be stored internally in the protocol tree.
We could store them "as is". However, this may cause problems when
displaying them, as the encodings in the font/font set/whatever used by
GTK+ in the user's locale, or the encodings used by whatever tool you're
using to view the printed output of Tethereal, or the output of "print
to file", or the encoding used by your printing software and your
printer, may not match the encoding in the packet. It would also make
entering display filter expressions that compare strings tricky, as
you'd have to enter them in the "right" character set.
We could store them in, say, 2-byte Unicode (big-endian, little-endian,
or whichever of those is the host's byte order), or as UTF-8. That
would work well with GTK+ 2.0, as it will, as I remember, use UTF-8
internally; however, it'd be a bit of a mess with GTK+ 1.x (not that
GTK+ 1.x always works well with "as is", either), and, whilst it may
simplify the issues for printing to a file or printer, as you only have
to worry about the output encoding, you still have to worry about that.
It affects display filter expression comparison similarly.
It also may cause problems if there are characters in the packet's
character set that aren't in 10646 - and requires that Ethereal know the
character set in the packet (which may require that it be told the
character set, e.g. for SMB clients using a code page rather than
Unicode in strings).
(It also raises the question of doing comparisons other than equality
and inequality comparisons - but that brings up the issue of dictionary
comparisons, which differ from culture to culture....)
My inclination would be to go with UTF-8 internally, if we can figure
out how to translate from packet character sets to UTF-8, how to
translate from UTF-8 to the "display" character set when
displaying the protocol tree in GTK+ 1.x (GTK+ 2.0 should, I
think, not be a problem when it comes out, but it's still under
development *and* not all systems will come with it by default
for a while, I suspect);
printing the protocol tree to a file or printer;
and how to translate from the character set in text widgets (and on the
command line) to UTF-8 (GTK+ 2.0 will presumably use UTF-8, but GTK+
1.x, and the command line, are a different matter).
This would obviate the need for separate FT_ types for Unicode strings
(as they'd be stored internally as UTF-8).