Wireshark-bugs: [Wireshark-bugs] [Bug 10681] UTF-8 replacement characters in FT_STRINGs are esca
Date: Thu, 14 Apr 2016 02:25:24 +0000

Comment # 7 on bug 10681 from
(In reply to Guy Harris from comment #6)
> (In reply to Jeff Morriss from comment #1)
> > (is HTTP supposed to be ASCII or UTF8?).
> 
> At least as I read RFC 7320 and RFC 3986, a request-target can be UTF-8.

I agree but RFC 3986 and especially RFC 3987 appear (to me) to say that UTF-8
must be percent-encoded (which means while there may be UTF-8 it's going to
look to us like the ASCII percent character followed by a couple more ASCII
characters).  The Wikipedia IRI page summarizes this in the least words:

https://en.wikipedia.org/wiki/Internationalized_resource_identifier

That being said, would there be any harm in Wireshark accepting UTF-8 in a URI?

Actually, there is: changing that tvb_get_string_enc() call to use ENC_UTF_8
results in assertions:

22:20:40          Warn Dissector bug, protocol HTTP, in packet 1:
../../epan/proto.c:3476: failed assertion "g_utf8_validate(value, -1, ((void
*)0))"

Hmmm...  Why isn't tvb_get_string_enc() returning valid UTF8 (like it says it
will)?


Anyway, the file_data field will, according to a comment in change 13275,
generally contain HTML or _javascript_; the former certainly can be UTF-8.  So
presumably that string should be extracted using ENC_UTF_8--if we can fix this
assertion problem.


You are receiving this mail because:
  • You are watching all bug changes.