Wireshark-dev: Re: [Wireshark-dev] FT_STRING, FT_STRINGZPAD, and null padding
From: John Thacker <johnthacker@xxxxxxxxx>
Date: Sat, 5 Sep 2020 12:51:09 -0400
On Sat, Sep 5, 2020 at 2:49 AM Guy Harris <gharris@xxxxxxxxx> wrote:
OK, here's the types of strings that there can be in protocols.

1. Counted strings.  A counted string has a count of the number of octets (it might be half the number of octets if it's UTF-16 or UCS-2, but it's really a count of octets).  There is nothing special about null characters, other than "if some software treats the string as a C string, it won't work correctly if the string happens to include a null character, so maybe that software shouldn't treat it as a C string".

2. Null-terminated, non-counted strings.  A null-terminated, non-counted, string has no count of the number of octets; it runs until a null terminator (8-bit, 16-bit, whatever) is seen.  Obviously, that string can't contain a null terminator.

3. Padded strings in a fixed-size buffer.  If the string is the length of the buffer, there is no null terminator; otherwise, there's a null terminator; the protocol may specify that all octets after the null terminator, if the string isn't the length of the buffer - 1, must be null, or may say there is no such requirement.

4. Strings that are counted *and* null-terminated.  This is redundant, but maybe somebody wants to make sure that there's a null at the end, to make life easier for implementations that use C strings internally.  This doesn't *really* make things easier, unless you're in a 100% controlled environment, because, in any environment in which the software has to deal with packets from buggy or malicious senders, and uses C strings internally, they have to make sure there's a null terminator *anyway* - they *can't* just copy the string and blithely assume that a terminating null is part of the counted blob.

In the case of a counted string, if the string is a 7-octet value of "foo\0bar", it should show up as exactly that in the packet tree item; if not, we should fix it to do so.

So this does not happen, at least not when proto_tree_add_item() is used on a FT_STRING.
Things that call tvb_format_text() will end up producing "foo\000bar" (because \0 is not one of the special cases handled, it defaults to using octal in format_text() in epan/strutil.c, which is used for non-printable but valid ASCII characters. However, proto_tree_add_item() does not do that.

I tested this by hexediting a pcap (from here: https://gitlab.com/wireshark/wireshark/-/issues/16208) to put a NUL in the middle of a counted FT_STRING, and it looked like this:

I had applied the patch to allow SSIDs to be UTF-8 in that bug, it looks like this:

Tag: SSID parameter set: admini\000tración
    Tag Number: SSID parameter set (0)
    Tag length: 15
    SSID: admini
        [Expert Info (Warning/Undecoded): Trailing stray characters]

(admini\000tración also showed up in the FT_COLUMN). In the first entry, where \000 shows up, tvb_format_text() is used explicitly and then that text is appended. In the one where it is cut off, proto_tree_add_item() is used. So that is a bug that should be fixed, then, we want to display the trailing characters as well as warn about them?

Currently, we use FT_STRING for the first of those, FT_STRINGZ for the second of those, FT_STRINGZPAD for the third of those, and FT_STRINGZ for the fourth of those; the difference between the two FT_STRINGZ cases is whether a length was specified or not.
That helps, but isn't there also FT_UINT_STRING for the common case of counted strings where the count of the number of octets appears immediately before the string? It seems like mostly a helper combining "add a FT_UINT[...] and return its value, then add a FT_STRING with that length." So ideally detect_trailing_stray_characters() should also operate on the string part of FT_UINT_STRING, and that just hasn't been added, correct?

Finally, detect_trailing_stray_characters() and tvb_format_string() also don't do the right thing for counted strings because they (or their callers) assume that the length in bytes of the string when being checked is the same as the count in octets passed in originally. That's true before get_ascii_string() or other places substitute the replacement character, which is three bytes in UTF-8, for individual bytes with the high bit set (for ASCII), or for illegal Unicode sequences that can be from 1 to 4 bytes, and thus changes the length of the string. So in the case of something like "f\xf6\xf6\0bar" (intended to be "föö\0bar" in ISO-8859-1, presumably) will be converted to "f��\0bar" which is 11 octets in length for the 7 characters, detect_trailing_stray_characters() will still be called with a length of 7 bytes (not UTF-8 characters), and thus won't notice the trailing "bar." Though at the same time there should probably be an expert info warning any time the UTF-8 replacement character is present as a hint that maybe the encoding is wrong.

We can see that be doing the same thing and *not* applying the patch to allow SSIDs to always be UTF-8. Treating it as ASCII, we see:

Tag: SSID parameter set: admini\000traci�
    Tag Number: SSID parameter set (0)
    Tag length: 15
    SSID: admini
        [Expert Info (Warning/Undecoded): Trailing stray characters]

(admini\000traci� also showed up in the FT_COLUMN). What's going on is that "\xc3\xb3" ("ó" in UTF-8) is not recognized as ASCII and is being replaced with "\xef\xbf\xbd\xef\xbf\bd", two copies of the 3 byte UTF-8 representation of UNREPR, so then some of the other functions that read the string stop after the first three replacement character bytes (reaching the length of 15 octets) and never get to the second replacement character or to the 'n."

Conversely, if I put non ASCII characters earlier in the string, I can get this nonsense:

Tag: SSID parameter set: a����i\000
    Tag Number: SSID parameter set (0)
    Tag length: 15
    SSID: a����i

Where the parts that call tvb_format_string() stop after 15 bytes are written, which is right at the first null but before the trailing characters that are in the 15 octet field, the part where proto_tree_add_item() stops after before the first NUL, and the expert info for detect_trailing_stray_characters() doesn't show up for the same reason that tvb_format_string() stops in time-- it reaches 15 octets in the substituted/validated string:

That's for a field with hexdump:

0000 61 c3 b3 c3 b3 69 00 74 72 61 63 69 c3 b3 6e a....i.t raci..n


which with the UTF-8 patch looks like a different sort of nonsense:

Tag: SSID parameter set: a����i\000
    Tag Number: SSID parameter set (0)
    Tag length: 15
    SSID: aóói
        [Expert Info (Warning/Undecoded): Trailing stray characters]

for something that looks like  aóói\0tración when interpreted as UTF-8

I do plan to file a bug or two, just wanted to check on the expected behavior.

John Thacker