Wireshark-dev: Re: [Wireshark-dev] No tvb_get for string-encoded numbers?
From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Sat, 5 Apr 2014 02:52:16 -0700
On Apr 4, 2014, at 2:01 PM, Hadriel Kaplan <hadriel.kaplan@xxxxxxxxxx> wrote:

> For protocols which are actually truly UTF-8, I'm planning to just assume treating them as ASCII is ok, because as far as I know the atoi/strtol/etc. functions don't actually care: if they see the ASCII characters for digits (and +/-/etc.) they'll parse it, else not. So any non-ASCII UTF-8 character in the sequence is meaningless to them and they stop parsing at that character.

Yes, the only valid octets in a number in any "extended ASCII" would be:

	0x2b, 0x2d, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37

	0x38 and 0x39 if the radix is 10 or 16;

	0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x61, 0x62, 0x63, 0x64, 0x65, and 0x66 if the radix is 16;

so anything with the 8th bit set is not valid, meaning that the same routine can handle ASCII, ISO 8859-n, various Windows code pages, various Mac code pages, and UTF-8 - the actual character encoding is irrelevant, as long as ASCII characters are encoded as a single octet having the ASCII code point value.