Wireshark · Ethereal-dev: Re: [Ethereal-dev] Unicode strings ...

Ethereal-dev: Re: [Ethereal-dev] Unicode strings ...

Note: This archive is from the project's previous web site, ethereal.com. This list is no longer active.

From: Guy Harris <gharris@xxxxxxxxx>

Date: Sun, 12 Aug 2001 17:12:34 -0700

On Sat, Aug 11, 2001 at 08:39:37PM -0400, Ed Warnicke wrote:
> Hmm... do you mean strings of wide characters or some other encoding (
> utf-8?).

There are (at least) two questions that come up when adding to the
protocol tree a character set whose contents come from a packet:

	1) In what form do we store characters in the protocol tree?

	2) In what form is the character string in the packet?

In the packet, the character string might be:

	ASCII (one byte per character, and no byte has the 8th bit
	set - a/k/a ISO 646 IRV:1991);

	other national ISO 646 variants (again, one byte per character,
	and no byte has the 8th bit set; some code points have glyphs
	other than the ones for ASCII, e.g. for accented letters and
	national currency symbols - see

		http://www.terena.nl/multiling/euroml/section04.html

	);

	ISO 8859/n character sets (one byte per character, and bytes
	*can* have the 8th bit set; 8-bit supersets of ASCII);

	various encodings of 2-byte (non-Unicode) character sets;

	various IBM PC/DOS/Windows code pages (some are 8-bit supersets
	of ASCII, some are, I think, encodings of 2-byte non-Unicode
	character sets);

	MacOS character sets;

	EBCDIC;

	big-endian 2-byte Unicode;

	little-endian 2-byte Unicode;

	perhaps big-endian and little-endian 4-byte ISO 10646;

	UTF-8;

	other encodings of ISO 10646;

	and so on.

A question then would be how strings in the various non-ASCII character
sets should be stored internally in the protocol tree.

We could store them "as is".  However, this may cause problems when
displaying them, as the encodings in the font/font set/whatever used by
GTK+ in the user's locale, or the encodings used by whatever tool you're
using to view the printed output of Tethereal, or the output of "print
to file", or the encoding used by your printing software and your
printer, may not match the encoding in the packet.  It would also make
entering display filter expressions that compare strings tricky, as
you'd have to enter them in the "right" character set.

We could store them in, say, 2-byte Unicode (big-endian, little-endian,
or whichever of those is the host's byte order), or as UTF-8.  That
would work well with GTK+ 2.0, as it will, as I remember, use UTF-8
internally; however, it'd be a bit of a mess with GTK+ 1.x (not that
GTK+ 1.x always works well with "as is", either), and, whilst it may
simplify the issues for printing to a file or printer, as you only have
to worry about the output encoding, you still have to worry about that. 
It affects display filter expression comparison similarly.

It also may cause problems if there are characters in the packet's
character set that aren't in 10646 - and requires that Ethereal know the
character set in the packet (which may require that it be told the
character set, e.g. for SMB clients using a code page rather than
Unicode in strings).

(It also raises the question of doing comparisons other than equality
and inequality comparisons - but that brings up the issue of dictionary
comparisons, which differ from culture to culture....)

My inclination would be to go with UTF-8 internally, if we can figure
out how to translate from packet character sets to UTF-8, how to
translate from UTF-8 to the "display" character set when

	displaying the protocol tree in GTK+ 1.x (GTK+ 2.0 should, I
	think, not be a problem when it comes out, but it's still under
	development *and* not all systems will come with it by default
	for a while, I suspect);

	printing the protocol tree to a file or printer;

and how to translate from the character set in text widgets (and on the
command line) to UTF-8 (GTK+ 2.0 will presumably use UTF-8, but GTK+
1.x, and the command line, are a different matter).

This would obviate the need for separate FT_ types for Unicode strings
(as they'd be stored internally as UTF-8).

Follow-Ups:
- Re: [Ethereal-dev] Unicode strings ...
  - From: Guy Harris
- Re: [Ethereal-dev] Unicode strings ...
  - From: Mark H. Wood

References:
- [Ethereal-dev] Unicode strings ...
  - From: Richard Sharpe
- Re: [Ethereal-dev] Unicode strings ...
  - From: Ed Warnicke

Prev by Date: RE: [Ethereal-dev] Routeing vs Routing
Next by Date: Re: [Ethereal-dev] Unicode strings ...
Previous by thread: Re: [Ethereal-dev] Unicode strings ...
Next by thread: Re: [Ethereal-dev] Unicode strings ...
Index(es):
- Date
- Thread

Riverbed Cascade Pilot: Take Wireshark to the Next Level - Advanced Triggers and Alerts; Web and VoIP Analytics; Long-Term Trending and Forensics; Deep Packet Analysis with Wireshark

Riverbed Cascade Pilot Personal Edition: Take Wireshark to the Next Level - Advanced Triggers and Alerts; Web and VoIP Analytics; Long-Term Trending and Forensics; Deep Packet Analysis with Wireshark

Riverbed AirPcap: Complete Visibility of Your Wireless Networks; Multi-Channel, Aggregated Analysis; Portable and Versatile; Easy to Setup and Easy to Use; Ready to Power Your Application

$Riverbed TurboCap: Full-Speed GbE Capture; Port Aggregation; Pass-thru Mode; Aggregating Tap; Full-Speed GbE Injection; Exported Interfaces; TurboCap API Developer\'s Pack$