Wireshark-dev: Re: [Wireshark-dev] Note about proto_tree_add_unicode_string (r43379)
From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Tue, 19 Jun 2012 12:47:28 -0700
On Jun 19, 2012, at 12:01 PM, Jakub Zawadzki wrote:

> Hi,
> 
> String from tvb_get_ephemeral_string() still needs escaping with format_text(),
> cause it doesn't check encoding.
> 
> When you use:
>  tvb_get_ephemeral_string_enc(tvb, offset, length, ENC_UTF_8 | ENC_NA);
> 
> It guarantees result encoded in UTF-8:
> * string as converted from the appropriate encoding to UTF-8 ...
> 
> (Code to do it is still in XXX's but this is bug in libwireshark and no one can blame you that you used wrong function :))

Or, rather, there is no code to do it and there's an XXX comment noting that:

        case ENC_UTF_8:
                /*
                 * XXX - should map all invalid UTF-8 sequences
                 * to a "substitute" UTF-8 character.
                 */
                strbuf = tvb_get_ephemeral_string(tvb, offset, length);
                break;

*That's* why escaping is still needed - we don't handle malformed UTF-8 strings.

If we were to:

	store strings in the protocol tree as a combination of an encoding value and a raw bucket of bytes;

	in the display filter code:

		implement comparison of string fields with text strings by attempting to convert the string field value to UTF-8 at that point and having all comparisons other than != fail and have != succeed if the string can't be converted;

		also support comparison of string fields with *byte* strings, e.g. a sequence of hex-encoded octets, and have that compare the raw bucket of bytes;

	have the "construct a filter from this field" UI generate, for string fields, a comparison with a UTF-8 string if the value can be converted to UTF-8 and a comparison with a byte string if it can't;

	in the code that generates displayable text for the protocol tree, replace anything in the string that can't be mapped to UTF-8 to a "substitute" UTF-8 character, including mapping invalid UTF-8 sequences if the string's encoding is UTF-8;

	do similar stuff if a string field's value is used to generate anything in the packet summary.

that might handle this, and might also defer conversion to UTF-8 until it's absolutely necessary, which might cut CPU usage.

(A further optimization might be to, in the display filter code implement comparison of string fields with text strings by attempting to convert the *text string* to the field's encoding at that point, with the result cached so you only have to do it once, and having all comparisons other than != fail and have != succeed if the string can't be converted.)

As for PDML:

	http://www.nbee.org/doku.php?id=netpdl:visualization_expression#data_associated_to_each_netpdl_field

says

	value (always present)
		It contains the field value as an hex string. In other words, the content of the value attribute in case of a 16 bit field whose value is 0800 (hex) is the string 0800. This value is exactly the same as written in the hex file, i.e. it does not change in case of little-endian or big-endian fields. Please note that for bitmasked fields, this value corresponds to the 'unmasked' value (i.e. the real field value must be derived from a bitwise AND operation between the mask and the value).

	showvalue(always present)

		It contains the field value in a “printable” form. For instance, field value ffffffffffff (related to a MAC address) can be associated to value ffffff-ffffff and assigned to this attribute.

which seems to indicate that, for a string field, "value" should be the raw hex bytes of the string (except perhaps for byte-swapping in the case of UCS-2, UCS-4, or UTF-16), *NOT* anything that looks like the text, and suggests to me that "showvalue" would be the same as what's generated by the code that generates displayable text for the protocol tree, i.e. it'd have invalid UTF-8 sequences replaced by a "substitute" character.

However, I don't know whether that's what programs out there that read PDML would expect - and it doesn't include the encoding for the string, so that code wouldn't know what to do with the string's value.