Wireshark · Wireshark-dev: Re: [Wireshark-dev] Request for RFC regarding string handling

Wireshark-dev: Re: [Wireshark-dev] Request for RFC regarding string handling

From: Ed Beroset <beroset@xxxxxxxxxxxxxx>

Date: Mon, 28 Oct 2013 20:03:17 -0400

Evan Huus wrote:

Does anybody have (or feel like developing) a Grand Unified Theory of
Wireshark's future string handling? Michael and Guy and I took a stab
at something like this in the comments of [2] but it's a bit
disjointed and we never really came to a consensus that I recall. Does
anyone know if the switch to Qt has any affect on this (does it make
sense to adopt QStrings everywhere, for example?)

I'll go ahead and toss another (related) log on the pile: should we bethinking about allowing for internationalization? We wouldn'tnecessarily need to actually provide the translations, but using theexisting Qt framework to allow internationalization might be a good ideaup front and may also help us work out some of the string handling.

The next time one of these issues pops up I would love to know already
how we *ought* to behave.

The difference between Wireshark and many other tools is that it'srequired to still "do the right thing" even with broken stringencodings. Both the machine encoding and the partially-rendered humanversion may be required.

I don't have a Grand Unified String Theory handy, but can think of somerequirements for it. One is that it may need to be able to render anumber of different encodings, including the various Unicode variations,ASCII, and maybe some others such as KOI8 and maybe even EBCDIC.Mappings will have to be sensitive to both the encoded length and beable to do something reasonable even with malformed encoded strings.

As more thought experiment than serious proposal, imagine that everyprotocol-based string (as contrasted with help screens or parts of theGUI) has something like the following structure:


typedef struct {
	encoding machine_form;  /* an enum of encodings */
	encoding human_form;	/* an enum of renderings */
	guint machine_len;	/* length of encoded form */
	guint human_len;	/* length of rendered form */
	guint8 **encoding_err;	/* array of pointers to
		encoding errors within machine form,
		or NULL if no errors */
	guint8 *machine;	/* pointer to encoded */
	guint8 *human;		/* pointer to rendered */
} string_s;

Is anything missing? For example, do we need to have something like"reason codes" corresponding to each encoding error? Is anythingredundant?

Also, if we make the possibly rash assumption that Unicode is thesuperset, perhaps we can regularize the addition of new renderings byrequiring conversions to and from Unicode and routines that can createan array of pointers (or maybe offsets) of encoding errors in theencoded version of the string.

Perhaps a look at wide characters and locales as implemented in C++could be useful, at least in terms of inspiration or at least gettingsome more concrete ideas on the scope of the problem.

Ed

Follow-Ups:
- Re: [Wireshark-dev] Request for RFC regarding string handling
  - From: Guy Harris
- Re: [Wireshark-dev] Request for RFC regarding string handling
  - From: Evan Huus

References:
- [Wireshark-dev] Request for RFC regarding string handling
  - From: Evan Huus