Wireshark-dev: Re: [Wireshark-dev] RFD: New language to write dissectors
From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Sat, 14 Jul 2012 15:31:06 -0700
On Jul 14, 2012, at 8:26 AM, Jakub Zawadzki wrote:

> It'd be great if we have some abstract and pure (no C/assembly inline) language to write dissectors.

Or "to describe protocols and the way packets for those protocols are displayed" - the languages in question wouldn't be as procedural as C/Lua/etc, they'd be more descriptive.

> We could invent yet another protocol desciption language,

...but, as you suggest, we probably shouldn't.

> but I was thinking to base grammar on netmon NPL [1] or wsgd [2].

Those are probably the two best choices.

I'm not sure it has to be a choice, though - we could implement both, resources permitting, of course.  (And, of course, given that there are many already-existing languages that describe protocols - ASN.1, {OSF IDL/MIDL/PIDL} for DCE RPC, rpcgen for ONC RPC, CORBA IDL, xcb for X11 - we will probably never have the One True Protocol Description Language.)

> I'm bigger fan of NPL (sorry Olivier), nmparsers project has got large collection of dissectors[3] 
> which we could use (LLTD - bug #6071, Windows USB Port packets - bug #6520, netsh - bug #6694)
> but there might exists some legal (patents for grammar/implementation?!) issues.

That would be one concern - even having "our own" language, such as wsgd, runs the risk of infringing a patent, but, well, *writing software of just about any sort* runs the risk of infringing a patent; however, we're dealing with a large corporation in the case of NPL, so there's probably a greater risk that some or all of it is covered by patents.  Were Microsoft to explicitly state that there are no patents on NPL-the-language or that they're granting a royalty-free license for all implementations (perhaps with a "mutual assured destruction" clause, so that were we to patent some feature of Wireshark and sue Microsoft for violating that patent, our license for their patents would terminate), and the same applied to any patents they hold on their implementation of NPL that would block independent useful implementations, that might help.

> With wsgd we could reuse some existing code of plugin.

...and we also have more freedom to extend the language, e.g. to support preferences for a protocol - Paul Long's blog post says

> A common problem: “No silly, we do HTTP traffic on port 8888, not 80 or 8080!”
>  
> While changing port mappings for protocols could be something revealed in the user interface, we haven’t gotten that far in Network Monitor 3.0 yet.  I expect we should address this specific problem on different fronts, i.e. a UI for each protocol, and some way to handle dynamic port allocations.  And there are also some heuristics we can use to identify protocols as well.  But today, there is a fairly simple way to modify the NPL script for protocols on non-standard ports.

I don't know whether, as of 3.4, they support "a UI for each protocol, and some way to handle dynamic port allocations", but we already have the infrastructure for that.

NPL also, for strings, offers 3 encodings - to quote the help manual:

> This data type extracts a specified number of characters from a sequence of bytes. The characters can be UTF-16, UTF-8, or ASCII, depending on the encoding specified.

There's no mention of the Extended Binary-Coded Decimal Interchange Code there, but we have several dissectors using ENC_EBCDIC, so that would be another place where we might want to extend NPL were we to use it.

Were there an "Open NPL Consortium" of some sort where multiple implementers of NPL could propose extensions, and perhaps a way an implementation could offer private extensions without worrying about colliding with other implementations or future standards, that might help.

Note, by the way, that having a language of this sort could allow something such as this.

Consider a protocol with the following description (in a C-like protocol description language that I'm making up on the fly):

	enum message_type {
		Login = 0,
		Logout = 1,
		Request = 2,
		Response = 3
	};

	struct login {
		ascii string username[16];
		ascii string password[16];
	};

	struct request {
		uint32 bigendian requested_item;
	};

	struct response {
		uint32 bigendian value_size;
		uint8 value[value_size];
	};

	struct request {
	protocol foo {
		uint32 bigendian enum message_type type;
		switch (type) {

		case Login:
			struct login login;

		case Logout:
			/* logout message has only a type */

		case Request:
			struct request request;

		case Response:
			struct response response;
		}
		uint32 bigendian message_id;
	};

which might translate to (in a pseudo-machine language I'm also making up on the fly):

	uint32 bigendian foo.type saveas x
	switch x:
		0	Login
		1	Logout
		2	Request
		3	Response
	Login:
		ascii string 16 foo.login.username
		ascii string 16 foo.login.password
		goto end
	Logout:
		goto end
	Request:
		uint32 bigendian foo.request.requested_item
		goto end
	Response:
		uint32 bigendian foo.response.value_size saveas y
		uint8 array y foo.response.value
		goto end
	end:
		uint32 bigendian foo.message_id

Now consider a dissection pass being done for a display filter "foo.message_id == 0x4073".  That full "compiled" program is overkill; that dissection pass might optimize it into

	uint32 bigendian foo.type saveas x
	switch x:
		0	Login
		1	Logout
		2	Request
		3	Response
	Login:
		skipbytes 32
		goto end
	Logout:
		goto end
	Request:
		skipbytes 4
		goto end
	Response:
		uint32 bigendian foo.response.value_size saveas y
		skipbytes y
		goto end
	end:
		uint32 bigendian foo.message_id

and, for that dissection pass, run that optimized version of the dissection "machine code" for the foo protocol, and similarly optimized versions of the dissection code.  The optimized versions of the dissection "machine code" might be generated as needed (rather than generating optimized versions for every protocol, just generate them from the base code the first time we try to run the code) and cached with the cache key being the set of fields in which the dissection in question was interested (whether because they're being used in a filter or for a column or in "-e {field}" in TShark or...).

This would allow us to get some of the effect of

	if (tree) {
		...
	}

without leaving it up to humans to get it right (which humans often don't), and allow us to do more such optimization as well (as it's not just "do I need a protocol tree?", it's "do I need anything other than these few fields and whatever fields are necessary to get at those fields").

(It also raises the question of whether interpreted execution of that "machine code" or translation to C or machine language will be faster - interpreted execution *could* result in a smaller cache footprint if the interpreter is small enough and the code "high-level" enough to be fairly dense, although it does involve difficult-at-best-to-predict branches in the interpretive loop.)

Of course, this would allow people to extend Wireshark without needing any C developer tools, and would reduce the need for stability in the dissector core code.  Translating to a "machine code" of the sort shown above might also significantly reduce compile time (maybe with support for the CORBA IDL, building Parlay support won't dim the lights :-)), and if those are all loaded at startup time, it might make it easier to build configurations of Wireshark that don't have Every Single Protocol Known To Man and that thus start up more quickly.

On the other hand, it might also allow protocol descriptions to be shipped either in source form or binary form with restrictions on redistribution, providing a way to "get around the GPL" for protocols.  Some might consider that a feature (I seem to remember many years ago Cisco raised this issue about some protocols) and others might consider it a bug.  If we end up with a consensus of "it's a bug", we might be able to extend the protections of the GPL to dissector descriptions fed to the interpreter, so that if you make a "compiled" protocol description available, you must also make the source available to recipients and must give recipients the right to redistribute the source or binaries.