Ethereal-dev: Re: [Ethereal-dev] While we're on the subject of new frametypes...
On Fri, Dec 13, 2002 at 04:22:39PM +1100, Tim Potter wrote:
> How about a new frametype for unicode strings?
Big-endian, or little-endian? You could tell "proto_tree_add_item()"
what the byte order is; "proto_tree_add_ustring()", however, would
probably need to take a byte order argument.
There's currently a commented-out FT_UCS2_LE in "epan/ftypes/ftypes.h",
for 2-byte little-endian Unicode. We could perhaps implement that.
However, I think there are some things we should think about before
doing Unicode (even if we don't come to a conclusion on all of them
first - we might be able to temporarily punt on the display and printing
issues by discarding or printing/displaying as an escape sequence
non-ASCII characters, so those issues may not require immediate
resolution):
1) What should we do about other extended-ASCII character sets?
Currently, we don't do anything clever, which means that, for
example, ISO 8859/1 strings might work OK if you're running
on some UNIX flavor with the locale set to an 8859/1 locale,
but don't work in other locales?
Should we make them Unicode strings, and have the dissector
translate them from the character set in question to Unicode?
Making the character set a property of the field might not
work - for example, that wouldn't work for OEM character sets
in SMB, as that'd have to be something set by an SMB
preference item at run time. It might work for the Mac
character set in Appletalk, however.
2) As long as we're going down that path, should we store *all*
strings as Unicode in the protocol tree, and just keep the
existing FT_STRING types, and:
perhaps have the byte-order argument to
"proto_tree_add_item()" specify, for FT_STRING types,
the character set and, in cases where a multi-byte
character type can come in either byte order, the byte
order;
add a character set+byte order argument to
"proto_tree_add_string()"?
That complicates life for GTK+ 1.2[.x], as you have to figure
out what character encoding is being used for the font, and
translate into that. However, GTK+ 2.x, and the Win32 GTK+
1.3[.x], use UTF-8, so we should be able to make that work
reasonably well. Doing so *might* fix *some* of the problems
people are reporting on Windows.
Recent versions of Qt use Unicode or UTF-8, so a KDE version
should be able to handle that, if we do one.
I don't know offhand what Aqua uses, but I wouldn't be
surprised if you could get it to use Unicode or UTF-8.
You can use Unicode for applications running on Windows NT
(NT 4.0, 2K, XP, .NET Server), so any native Windows GUI (or
Packetyzer) should be able to make that work. Windows OT
(95, 98, Me) is another matter; there is the "Microsoft Layer
for Unicode on Windows 95/98/Me Systems":
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/win9x/unilayer_4wj7.asp
which might help - however, that *might* also affect non-GUI
APIs, causing them to use Unicode as well. If so, we'd have
to deal with that somehow.
Text output gets tricky. On Windows, if you do a "print to
file" in Network Monitor 2.0, it prints out a Unicode text
file (which is a bit annoying if I wanted an ASCII text file,
although "tr"ing it on UNIX can end that annoyance by
stripping out the extra null bytes). We could, I guess, do
that on Windows for Tethereal and printing, although we might
have to further Windowsify the printing code to make that
work right.
On UNIX, if we can find some way to translate from Unicode or
UTF-8 to the locale's character set, we could do that for
Tethereal and printing. The iconv library *might* handle
that, although that'd require the native iconv library to
handle UTF-8 or Unicode - I'm not sure all of them do; I seem
to remember some version of Solaris having some special
add-on developer's pack to add UTF-8 support, so it might not
handle it in that and earlier Solaris versions, although I
think Solaris 8 handles it natively - or force us to require
GNU iconv on platforms that lack a version of iconv that can
handle Unicode or UTF-8.
> Currently they can
> either be displayed as a normal string in which case you get the first
> character, or as a bunch of bytes which isn't very attractive.
Or you could de-Unicodeize them and use FT_STRING-family types, which is
better than a poke in the eye with a sharp stick, but doesn't handle
non-ASCII characters. I think we do that in some places.