Ethereal-dev: [Ethereal-dev] RFC: new TVBUFF type

Note: This archive is from the project's previous web site, ethereal.com. This list is no longer active.

From: "Ronnie Sahlberg" <rsahlber@xxxxxxxxxxxxxx>
Date: Tue, 22 May 2001 18:10:50 +1000
Hi list,
Please give me comments on this idea.

I have been thinking a lot over how the IP reassembly is implemented and I
really dont like
that it requires double backing store in malloc()ed memory.
Then I thought, that what IP defragment needs is basically the same thing as
what some other
interesting features would need. The features I think of are
1: treating a TCP datastream as a "array of bytes".
2: merging parts of various UDP/TCP payloads into larger TVBUFFs which one
can dissect/do other interesting stuff with.
After deciding on and implementing it usefully I plan to use IP
defragmentation as a test/proof of concept.


The design I have come up which which would be reasonable generic and
(hopefully) useful is:

TVBUFF_SPARSE_ARRAY
Create a new type of TVBUFF, lets call it TVBUFF_SPARSE_ARRAY.
This TVBUFF type can be thought of similar to COMPOSITE, but that one can
insert a new TVBUFF segment anywhere
not only prepend and append. Also, one can have "holes" in the data which
this TVBUFF describes.
This type of TVBUFF can be created and destroyed as any other TVBUFF, though
none of the accessorts or any
other functions in tvbuff.c can operate on this TVBUFF.
* tvb_sparse_array_new()
* tvb_sparse_array_free()
This TVBUFF type is only a placeholder to keep other tvbuffs in until
TVBUFF_COMPOSITE can be extracted.

Then there will be two functions which only operate on TVBUFF_SPARSE_ARRAY
* int tvb_sparse_array_insert(tvb_sparse_array, start, len, tvb, offset)
       tvb is any ordinary tvb type, not SPARSE_ARRAY.
       This creates (if nessecary) a TVB_SUBSET of tvb of the data from
offset to offset+len
       and links it into tvb_sparse_array from start to start+len.
       Linking will be similar to how COMPOSITE works today, ie no
copying/duplication of data.
       If entire tvb_sparse_array[start,start+len] is already mapped (i.e.
no holes in this interval) the insertion will fail and return 0.
       If there is a partial overlap between tvb[offset,offset+len] and
tvb_sparse_array[start,start+len], 1+ SUBSETS will be created
       of tvb[offset,offset+len] to fit the holes in
tvb_sparse_array[start,start+len].

       To keep track of all REAL_DATA and SUBSET tvbs we have mapped into
the SPARSE_ARRAY we keep a "list" of
       { start, len, tvb, offset } structures in each SPARSE_ARRAY. How this
list is sorted and how it is managed I will describe below.

       I think one can limit the size of start,len,offset to be guint32, ie
allow a virtual array of 4Gb for each SPARSE_ARRAY.
       (not really good, but will do for now) ((e.g. this allows direct
mapping of TCP sequence numbers of a packet to start))

      One can se this function as populating the "virtual" 4Gb array space
of  the sparse_array tvb.

* tvbuff_t *tvb_sparse_array_extract_tvb(tvb_sparse_array, start, len,
flags)
      This function will scan the tvb_sparse_array from start to start+len
and if there are no holes in this interval
      (i.e. we have all this data mapped from other tvbs) it will create a
new COMPOSITE tvb and return this tvb.
      If there are any data not mapped (a hole) in
tvb_sparse_array[start,start+len] the call will fail and return NULL.

      The flags are for special features, as follow-tcp-stream.
      0x0001  Scan back until start of current mapped patch
             If this flag is set, the mapping will not start from
tvb_sparse_array[start] but from as far before start which is possible
             without passing a hole.
      0x0002 Scan forward until end of current mapped patch
             With this flag set, the returned tvb will not onlky map up to
tvb_sparse_array[start+len] but as far beyond start+len as
             possible without creating a hole.
      flags==0 will probably be normal usage, but flags ==0x0003 would be
useful for Follow-TCP-Stream where one packet
      in the middle of a conversation is selected and one wants to process
as much as possible from the conversation.

      One must extract a tvb using this function before one can access any
data in the sparse_array.
      One can not extract data or use any accessor on the sparse_array
itself.




How to structure/sort the patches in the sparse_array.
There can potentially be a very large number of REAL_DATA or SUBSETS making
up a large SPARSE_ARRAY. I think of
NFS over TCP and similar things, ie I can easily imagine a sparse_array
consisting of 10s of thousands of SUBSETs.

Hash-tables will probably not be very useful. What is there to hash? If one
considers start==tcp.sequencenumber or start==ip.fragmentoffset
one sees that for these applications, the range of start values will be
local in the address domain (e.g. 10Mb contignous over 4Gb address space)
also, when extracting data from start to start+len there is no guarantee
that start will match the start value of the underlying SUBSET.
No one needs to be able to search for which mapped SUBSET will start fall
into.
Ie if we search for start==27,  we want to find the SUBSET that maps 20 to
29 (and extract starting from27).

Sequential searches through linked lists will be too slow if the list is
big. But double linked lists are fast to scan forward and backwards in.
In a double linked list, it is very inexpensive to add a new head or tail.

Balanced binary trees are very fast to do a random search in (Logn) but more
complicated to do forward/backward scans in (thats what macros
are for). Another drawback is that self-balancing trees are very expensive
to delete nodes in.
It is reasonably easy and cheap to keep a tree balanced after adding new
nodes to the tree though.

One can also estimate that the most common case will be (always) adding new
nodes either to the far left in the tree (start values decreasing as we add
new data from new packets, i.e. ip.fragmentoffset decreasing packet by
packet (linux sending fragments in the "nonobvious" order)) or to the far
right (ever
increasing TCP secuence numbers. I think that adding new nodes to the middle
of the tree will be a special/unusual case.


To manage this datastructure efficiently in the common cases I think the
best solution is to implement a a special version of
a self-balanced-tree which is optimized for adding leaf-nodes to the far
right or far left of the tree. Also, this tree need not support deleting any
nodes,
either we add nodes or we delete the entire tree.
It is very easy to create an very cheap algorithm to add nodes when the
common case is adding either nodes to the extreme right or extreme left.
(this common case will also easily lead to almost perfectly balanced trees).

But sequental scan forward/backward can potentially be an expensive
operation in a tree (even a balanced one might need traversing up to 30-35
nodes in the tree of 100.000 elements just to get from one element to the
sequentially next one). Sequential scans forward/backward will be used when
creating the
extracted COMPOSITE scanning from start to start+len. Linked lists are much
more efficient in this case.
So combine the tree with linked list. When adding new nodes, add the node
also to the linked list (always cheap since when linking it to the tree we
always know either the element immediately before or after the new one in
the list).


I have always thought the combination : binary-tree+linked-list is the best
data structure there is. Am I crazy?
Well, I guess a specialized manager for the binary-tree+linked-list
structure can be implemented in a few hundred lines of code.


Applications for SPARSE_ARRAY would be:
* IP reassembly
* TCP streams as array of bytes, extract a COMPOSITE containing all data
within an arbitrary subset of any TCP conversation. (if all data exists)
* RPCoverTCP Create and dissect the payload of large upper layer protocols,
NFS Read/Write over TCP.
* NFS files, Merge the COMPOSITES from NFS Read/Writes into a SPARSE_ARRAY
and extract a COMPOSITE which spans the entire
   file.
* Allow TCP to link all packets with len>0 into SPARSE_ARRAYs. One for each
conversation (and direction). Follow TCP stream would
   then be extremely easy to reimplement. Follow-TCP-Stream would not need
to filter any data, just ask for a COMPOSITE as big as possible
   (flags==0x0003) from the SPARSE_ARRAY which maps the conversation.
* Reconstructing files over FTP would be trivial.
* Reconstructing files/transactions over HTTP would be trivial.
etc etc