Evan Huus wrote:
Does anybody have (or feel like developing) a Grand Unified Theory of
Wireshark's future string handling? Michael and Guy and I took a stab
at something like this in the comments of [2] but it's a bit
disjointed and we never really came to a consensus that I recall. Does
anyone know if the switch to Qt has any affect on this (does it make
sense to adopt QStrings everywhere, for example?)
I'll go ahead and toss another (related) log on the pile: should we be
thinking about allowing for internationalization? We wouldn't
necessarily need to actually provide the translations, but using the
existing Qt framework to allow internationalization might be a good idea
up front and may also help us work out some of the string handling.
The next time one of these issues pops up I would love to know already
how we *ought* to behave.
The difference between Wireshark and many other tools is that it's
required to still "do the right thing" even with broken string
encodings. Both the machine encoding and the partially-rendered human
version may be required.
I don't have a Grand Unified String Theory handy, but can think of some
requirements for it. One is that it may need to be able to render a
number of different encodings, including the various Unicode variations,
ASCII, and maybe some others such as KOI8 and maybe even EBCDIC.
Mappings will have to be sensitive to both the encoded length and be
able to do something reasonable even with malformed encoded strings.
As more thought experiment than serious proposal, imagine that every
protocol-based string (as contrasted with help screens or parts of the
GUI) has something like the following structure:
typedef struct {
encoding machine_form; /* an enum of encodings */
encoding human_form; /* an enum of renderings */
guint machine_len; /* length of encoded form */
guint human_len; /* length of rendered form */
guint8 **encoding_err; /* array of pointers to
encoding errors within machine form,
or NULL if no errors */
guint8 *machine; /* pointer to encoded */
guint8 *human; /* pointer to rendered */
} string_s;
Is anything missing? For example, do we need to have something like
"reason codes" corresponding to each encoding error? Is anything
redundant?
Also, if we make the possibly rash assumption that Unicode is the
superset, perhaps we can regularize the addition of new renderings by
requiring conversions to and from Unicode and routines that can create
an array of pointers (or maybe offsets) of encoding errors in the
encoded version of the string.
Perhaps a look at wide characters and locales as implemented in C++
could be useful, at least in terms of inspiration or at least getting
some more concrete ideas on the scope of the problem.
Ed