Wireshark-bugs: [Wireshark-bugs] [Bug 10681] UTF-8 replacement characters in FT_STRINGs are esca
Comment # 7
on bug 10681
from Jeff Morriss
(In reply to Guy Harris from comment #6)
> (In reply to Jeff Morriss from comment #1)
> > (is HTTP supposed to be ASCII or UTF8?).
>
> At least as I read RFC 7320 and RFC 3986, a request-target can be UTF-8.
I agree but RFC 3986 and especially RFC 3987 appear (to me) to say that UTF-8
must be percent-encoded (which means while there may be UTF-8 it's going to
look to us like the ASCII percent character followed by a couple more ASCII
characters). The Wikipedia IRI page summarizes this in the least words:
https://en.wikipedia.org/wiki/Internationalized_resource_identifier
That being said, would there be any harm in Wireshark accepting UTF-8 in a URI?
Actually, there is: changing that tvb_get_string_enc() call to use ENC_UTF_8
results in assertions:
22:20:40 Warn Dissector bug, protocol HTTP, in packet 1:
../../epan/proto.c:3476: failed assertion "g_utf8_validate(value, -1, ((void
*)0))"
Hmmm... Why isn't tvb_get_string_enc() returning valid UTF8 (like it says it
will)?
Anyway, the file_data field will, according to a comment in change 13275,
generally contain HTML or _javascript_; the former certainly can be UTF-8. So
presumably that string should be extracted using ENC_UTF_8--if we can fix this
assertion problem.
You are receiving this mail because:
- You are watching all bug changes.