Wireshark · Wireshark-bugs: [Wireshark-bugs] [Bug 10681] UTF-8 replacement characters in FT

Wireshark-bugs: [Wireshark-bugs] [Bug 10681] UTF-8 replacement characters in FT_STRINGs are esca

Date: Thu, 14 Apr 2016 13:55:30 +0000

Comment # 9 on bug 10681 from Jeff Morriss

(In reply to Guy Harris from comment #8)
> (In reply to Jeff Morriss from comment #7)
> > Hmmm...  Why isn't tvb_get_string_enc() returning valid UTF8 (like it says
> > it will)?
> 
> If it says that, it lies.

True, it doesn't really say that. It says it will convert the string to UTF-8,
*possibly* mapping characters or invalid octet sequences to the Unicode
replacement character (emphasis mine).

> It calls tvb_get_utf_8_string() to extract the string, and
> tvb_get_utf_8_string() is:

[...]

> which does *no* validation of the string whatsoever.

> I seem to remember some discussion of this and some concern that doing the
> validation would slow down dissection significantly.  If so, perhaps what
> needs to be done is to have the value of an FT_STRING field be a combination
> of an ENC_ value and a raw blob of bytes copied directly from the packet,
> with the blob converted to valid UTF-8 when necessary - with that
> conversion, for ENC_UTF_8, getting rid of invalid UTF-8 sequences.

When (at what point) would it be necessary?

Currently we're asserting out before adding it to the tree--presumably to
protect the display (though that assertion came inr53827 without an
explanation as to why).

Could we delay until it's displayed?

If not are there enough cases where we extract/encode a UTF8 string *without*
adding it to the tree to warrant delay the doing the validation+conversion in
proto_tree_add_string()?

You are receiving this mail because:

You are watching all bug changes.

Prev by Date: [Wireshark-bugs] [Bug 10681] UTF-8 replacement characters in FT_STRINGs are escaped for presentation
Next by Date: [Wireshark-bugs] [Bug 12268] Stack exhaustion in proto_tree_traverse_XXX_order
Previous by thread: [Wireshark-bugs] [Bug 10681] UTF-8 replacement characters in FT_STRINGs are escaped for presentation
Next by thread: [Wireshark-bugs] [Bug 10681] UTF-8 replacement characters in FT_STRINGs are escaped for presentation
Index(es):
- Date
- Thread