Wireshark-dev: Re: [Wireshark-dev] Insufficient Data for Heuristic
From: Evan Huus <eapache@xxxxxxxxx>
Date: Sat, 22 Feb 2014 20:21:41 -0500
On Sat, Feb 22, 2014 at 7:46 PM, Guy Harris <guy@xxxxxxxxxxxx> wrote:
>
> On Feb 22, 2014, at 4:13 PM, Evan Huus <eapache@xxxxxxxxx> wrote:
>
>> If a dissector checks the captured length and finds that it doesn't
>> have enough data captured to run its heuristic (assuming there was
>> enough on the wire for the packet to be valid), should that count as
>> an auto-pass, or an auto-fail (ie should the heuristic reject the
>> packet, or assume that it's valid and skip the check)?
>>
>> My instinct is to count it as a pass; we'll dissect the first few
>> fields then throw an exception. I suppose there are potentially other
>> dissectors in line that would actually accept the packet, but then
>> there might also be cases where there aren't any, and we'd be leaving
>> it undissected.
>
> "Leaving it undissected" is independent of the order in which the dissectors' register-handoff routines are run; "letting the first one dissect it" isn't independent of that order.

Good point.

> Perhaps it's time to split the "check if this is a packet for this protocol" and "dissect this packet" operations into separate functions.  With that, for any given protocol with zero or more key-based dissector tables and a heuristic dissector table, you would have dissectors that are registered in one of the key-based dissector tables, if there are any, and dissectors that are registered in the heuristic dissector table.  The only difference between the two tables would be that entries in the key-based tables have a key (port number, protocol number, media type, etc.) and entries in the heuristic-based tables don't.

So register_dissector would take two function pointers - one for the
dissection and one for the heuristic? Calling a dissector would
*always* consist of making sure the heuristic (if any) returned true
before dissecting?

Sounds like a neat idea, but a lot of work and possibly expensive to
run that many heuristics.

> If there's one or more entries in a key-based dissector table matching a given key, the "check if this is a packet for the protocol" routine would be run for each of them; if there is no such routine for an entry, we'd treat that as a routine that always says "yes".  If only one routine matches, we'd call the corresponding "dissect this packet" routine; if more than one matches, or if none matches, we'd dissect it as data.

That's another tangential question. Is it better to guess and (maybe)
be wrong, or to just display as raw data and let the user specify what
it is?

The statistics nerd in me wants to start righting a Bayesian decode-as
predictor that would learn the types of captures you look at and guess
what protocols were present based on that, but that's never gonna
happen.

> If there's one or more heuristic dissectors in a heuristic dissector table, the "check if this is a packet for the protocol" routine would be run for each of them.  (We would reject attempts to register a null "check if this is a packet for the protocol" routine in a heuristic dissector table.)  If only one routine matches, we'd call the corresponding "dissect this packet" routine; if more than one matches, or if none matches, we'd dissect it as data.
>
> In the cases where there's more than one, we'd note the protocols for them, and, in the "Dissect As..." dialog, present those protocols.  If a protocol is selected, we'd somehow mark its entry as "always use this entry", so that the above searches for a dissector to hand off to are skipped.
>
> In this case, if we count "not enough data" as an auto-pass, we'd end up punting the choice of dissector to the user if more than one matched.
>
> A variant would be to have a "strong pass" (enough data to check, and the check passed) and a "weak pass" (not enough data to check), prefer strong passes to weak passes, choose the strong pass if there's only one, and punt to the user if there are no strong passes but there's at least one weak pass or if there's more than one strong pass (and possibly sort the strong passes before the weak ones).

Or, alternatively, a scoring-based (integer) heuristic and we simply
choose the heuristic returning the highest score.

Lots of interesting questions here, but all of them require
non-trivial work. Given the context of the review I was hoping for an
interim decision as to what we recommend given the current API?