Wireshark-dev: Re: [Wireshark-dev] FT_STRING, FT_STRINGZPAD, and null padding
From: John Thacker <johnthacker@xxxxxxxxx>
Date: Sat, 5 Sep 2020 14:09:54 -0400


So for the problem with calling format_text() on a string that has had replacement characters applied, I see that packet-ieee80211.c calls format_text() on a string that has already been obtained with tvb_get_string_enc() in order to get something printable.
This is pretty common, a lot of dissectors do this, because format_text() expects UTF-8, tvb_format_text() doesn't take an encoding at all and just assumes UTF-8 (and works on ASCII), and it makes sense to try to call something that does charset conversion to UTF-8 first for any field that might have an encoding other than ASCII or UTF-8.

However, tvb_get_string_enc() for ASCII strings replaces characters with the high bit set with the Unicode Replacement character, which can change the length of the string. It's not clear what length to pass into format_text() - the original length can be wrong, but strlen() doesn't work for the cases where there's a NUL in the middle of the string, as it won't get the trailing characters. A similar problem would occur if tvb_get_string_enc() actually did replacement characters for UTF-8 invalid characters (it doesn't do that now, see the comments of https://gitlab.com/wireshark/wireshark/-/issues/14948 ), and does occur with all the other encodings, where the replacement character substitution already does occur.

So I don't particularly see any easy way to handle the counted strings that might have replacement characters substituted in but also might have a NUL in the middle and we still want to display any trailing characters in the field after the NUL and warn about them.

For the dissectors that only have either ASCII or UTF-8 as the encoding, a workaround is to call tvb_format_text() instead of calling format_text() on the output of tvb_get_string_enc(). For dissectors with any other encoding, that's not an option and as a result it doesn't feel great as a solution.

To handle other encodings, I can only see:
1) Changing tvb_get_string_enc - and the encoding specific helper functions it calls - to pass back the true length as a output parameter
2) Changing tvb_get_string_enc to pass back a wmem_strbuf_t type that handles embedded nulls and has length, use that type more widely, pass it into the format function
3) Having some kind of tvb_format_text_enc() function (and variations for whitespace) that takes different encodings, and differs from tvb_get_string_enc in that it produces a printable output.

3) alone seems like a bad solution, because it would probably be called in addition to tvb_get_string_enc for the same string in the same dissector, and that's hitting some very similar tvbuff access and conversion code twice. It also would mean some more tricky and error prone conversions. It could be added as syntactic sugar for the function composition of tvb_get_string_enc() and format_text() if 1) or 2) was implemented.

John Thacker