Wireshark-bugs: [Wireshark-bugs] [Bug 10681] UTF-8 replacement characters in FT_STRINGs are esca
Comment # 9
on bug 10681
from Jeff Morriss
(In reply to Guy Harris from comment #8)
> (In reply to Jeff Morriss from comment #7)
> > Hmmm... Why isn't tvb_get_string_enc() returning valid UTF8 (like it says
> > it will)?
>
> If it says that, it lies.
True, it doesn't really say that. It says it will convert the string to UTF-8,
*possibly* mapping characters or invalid octet sequences to the Unicode
replacement character (emphasis mine).
> It calls tvb_get_utf_8_string() to extract the string, and
> tvb_get_utf_8_string() is:
[...]
> which does *no* validation of the string whatsoever.
> I seem to remember some discussion of this and some concern that doing the
> validation would slow down dissection significantly. If so, perhaps what
> needs to be done is to have the value of an FT_STRING field be a combination
> of an ENC_ value and a raw blob of bytes copied directly from the packet,
> with the blob converted to valid UTF-8 when necessary - with that
> conversion, for ENC_UTF_8, getting rid of invalid UTF-8 sequences.
When (at what point) would it be necessary?
Currently we're asserting out before adding it to the tree--presumably to
protect the display (though that assertion came inr53827 without an
explanation as to why).
Could we delay until it's displayed?
If not are there enough cases where we extract/encode a UTF8 string *without*
adding it to the tree to warrant delay the doing the validation+conversion in
proto_tree_add_string()?
You are receiving this mail because:
- You are watching all bug changes.