Wireshark-dev: [Wireshark-dev] Proposal to improve filtration speed by caching fields that are
From: Sidhant Bansal <sidhbansal@xxxxxxxxx>
Date: Mon, 15 Jun 2020 11:38:49 +0800
Hi all,

I want to propose an improvement to speed up the display filters by avoiding to re-dissect all the packets again and again when not required and instead maintaining a cache of the fields that have been queried recently.

Motivation: Benchmarking filtering on capture files > 100 MB shows that the re-dissection step, i.e the amount of time spent inside the dissector tends to be a lot, i.e > ~40-50% of the total time spent is consumed to re-dissect. I believe we can make huge savings here.

Example:
1st Filter applied: tcp.srcport >= 1200 && tcp.dstport <= 1500
This filter runs normally as it does right now AND stores the tcp.srcport and tcp.dstport for all the packets on-memory in wireshark
2nd Filter applied: tcp.srcport == 80
We don't need to re-dissect all the packets again and can simply refer to the information stored to apply the filter.
3rd Filter applied: tcp.srcport == 120 || udp.srcport == 80
Since we haven't stored "udp.srcport" in our cache, therefore we need to re-dissect again AND we will store udp.srcport for all the packets also (to speed-up future filter queries)
4th Filter applied: tcp.srcport == 40 || udp.srcport >= 1000 || tcp.dstport <= 500
Since all of these fields are in cache, so we can refer to them directly from the on-memory information stored and don't need to re-dissect any of the packets.

We can limit the number of fields we store on-memory at any given moment of time depending on how many packets we have and how much memory we can afford to allocate. And deleting the fields from the cache can be done according to a specific cache replacement policy (I haven't thought about which one will the most apt, input is welcome)

Most of the fields tend to be fixed-length in terms of bytes and are small, i.e <= 8bytes. For fields such as strings that are variable-length and can be arbitrarily large we can avoid doing this caching procedure and instead re-dissect all the packets if the filter _expression_ consists of such a field.

From an implementation point of view: The cached fields information can be stored inside the frame_data since that remains persistent throughout wireshark's execution for a single capture file opened. Now whenever we encounter a new filter query we can check if all the fields are in the cache or not? If yes, then once we convert our abstract syntax tree of the filter query to DFVM and then query, we should lookup the cache instead of re-dissecting. If no, then we do what we do currently, i.e re-dissect but we also store this new field into our cache (according to the specific replacement policy)

Want to know about any feedback or objections to this optimization.