Wireshark-dev: Re: [Wireshark-dev] Request for RFC regarding string handling
From: Ed Beroset <beroset@xxxxxxxxxxxxxx>
Date: Mon, 28 Oct 2013 20:03:17 -0400
Evan Huus wrote:
Does anybody have (or feel like developing) a Grand Unified Theory of
Wireshark's future string handling? Michael and Guy and I took a stab
at something like this in the comments of [2] but it's a bit
disjointed and we never really came to a consensus that I recall. Does
anyone know if the switch to Qt has any affect on this (does it make
sense to adopt QStrings everywhere, for example?)

I'll go ahead and toss another (related) log on the pile: should we be thinking about allowing for internationalization? We wouldn't necessarily need to actually provide the translations, but using the existing Qt framework to allow internationalization might be a good idea up front and may also help us work out some of the string handling.

The next time one of these issues pops up I would love to know already
how we *ought* to behave.

The difference between Wireshark and many other tools is that it's required to still "do the right thing" even with broken string encodings. Both the machine encoding and the partially-rendered human version may be required.

I don't have a Grand Unified String Theory handy, but can think of some requirements for it. One is that it may need to be able to render a number of different encodings, including the various Unicode variations, ASCII, and maybe some others such as KOI8 and maybe even EBCDIC. Mappings will have to be sensitive to both the encoded length and be able to do something reasonable even with malformed encoded strings.

As more thought experiment than serious proposal, imagine that every protocol-based string (as contrasted with help screens or parts of the GUI) has something like the following structure:

typedef struct {
	encoding machine_form;  /* an enum of encodings */
	encoding human_form;	/* an enum of renderings */
	guint machine_len;	/* length of encoded form */
	guint human_len;	/* length of rendered form */
	guint8 **encoding_err;	/* array of pointers to
		encoding errors within machine form,
		or NULL if no errors */
	guint8 *machine;	/* pointer to encoded */
	guint8 *human;		/* pointer to rendered */
} string_s;

Is anything missing? For example, do we need to have something like "reason codes" corresponding to each encoding error? Is anything redundant?

Also, if we make the possibly rash assumption that Unicode is the superset, perhaps we can regularize the addition of new renderings by requiring conversions to and from Unicode and routines that can create an array of pointers (or maybe offsets) of encoding errors in the encoded version of the string.

Perhaps a look at wide characters and locales as implemented in C++ could be useful, at least in terms of inspiration or at least getting some more concrete ideas on the scope of the problem.

Ed