Wireshark-dev: Re: [Wireshark-dev] No tvb_get for string-encoded numbers?
From: Hadriel Kaplan <hadriel.kaplan@xxxxxxxxxx>
Date: Fri, 4 Apr 2014 17:01:50 -0400
On Apr 4, 2014, at 4:04 PM, Guy Harris <guy@xxxxxxxxxxxx> wrote:

> On Apr 4, 2014, at 7:30 AM, Hadriel Kaplan <hadriel.kaplan@xxxxxxxxxx> wrote:
> 
>> I might be overlooking something, but I don’t see a tvb_get_* function to get a uint8/16/32/64 that was encoded as a ascii or utf-8 string in the packet. Is there such a thing?
> 
> No.
> I've occasionally also thought there should be such a routine.

I've started coding it today but my real day-job is getting in the way. :)


> Note, though, that, whilst tvb_get_guint8() and tvb_get_{n,le}tohXXX() can never fail, because every possible sequence of octets is a valid 2's complement integral value, routines to get a number encoded as a string *can* fail, e.g. 0123xyzw is not a valid number in bases 8, 10, or 16.

I'm thinking the tvb_get_* routines should just execute strtol() internally (and the other flavors depending on number type)... and thus return 0 if it couldn't be read. And it would internally handle out-of-range errors, but still return 0.

But I'm also thinking the tvb_get_* would take the same optional char** endptr param that strtol() takes, so it can pass back the ending offset - that way you could catch things like "0123xyzw", because if I recall correctly it actually *is* successfully converted to a number: the long int 123 for the "0123" portion. (though I could be mis-remembering)

Fwiw, looking at some of the consumers of such a routine, they don't seem overly concerned with errors currently when they call atoi/strtol/etc.

Also, I'm only planning to do this for base 10. A proto needing something else can do it the long way.


> There are other cases where a tvb_get_ routine can return "you lose", e.g. tvb_get_string_enc() can fail if there are invalid octet sequences (about the only encodings I know of where *every* octet sequence is a valid string are some of the ISO 8859-n encodings), and at least some floating-point formats probably have invalid values (I guess an IEEE NaN is "valid", at least to the extent that if we try to format it it'll show up as "NaN", but if we try to do calculations with it we might get a floating-point exception.

As an aside, I'm *only* thinking of having this for ASCII strings; anything else needs to do it the long way.

For protocols which are actually truly UTF-8, I'm planning to just assume treating them as ASCII is ok, because as far as I know the atoi/strtol/etc. functions don't actually care: if they see the ASCII characters for digits (and +/-/etc.) they'll parse it, else not. So any non-ASCII UTF-8 character in the sequence is meaningless to them and they stop parsing at that character.

And since the current potential consumers of these routines only call atoi/strtol/etc. right now, they're really only doing the conversion for ASCII anyway, afaik.  But I'll test it to verify, in case I'm wrong.

So what that means is I don't plan to have an encoding ENC_* param for the tvb_get_* routines, nor need the tvb_get_* routines to internally call tvb_get_string_enc(). In fact I'm hoping to do without any temp string being created, even inside the tvb_get_* functions.


> And I'd like to see proto_tree_add_XXX_item() routines that add an item with a particular type *and* take a pointer argument and return the value for the item through that pointer;

Good idea.


>> And if we had common functions handle ascii and utf-8 string-encoded numbers, they could avoid creating temporary strings as well.
> 
> The only real encoding issues are "ASCII superset" (so that "0123456789", for example, is encoded the same as in ASCII) vs. "2 or more bytes per ASCII character" (e.g., UCS-2, UTF-16, and UCS-4) vs. "one of those 7-bit GSM character encodings" vs. "EBCDIC".

See above.  The number of proto's that need such a thing for anything but ASCII and UTF-8 is pretty small I think. And it appears the ones that are truly UTF-8 do it for ASCII only anyway.

-hadriel