Wireshark-users: Re: [Wireshark-users] filter for ONLY initial get request
From: "Thierry Emmanuel" <Emmanuel.Thierry@xxxxxxxxxxxxxxx>
Date: Thu, 12 Aug 2010 10:09:52 +0200

-----Original Message-----
From: Jeffs [mailto:jeffs@xxxxxxxxxxxxx] 
Sent: mercredi 11 août 2010 22:08
To: Thierry Emmanuel
Cc: Community support list for Wireshark
Subject: Re: [Wireshark-users] filter for ONLY initial get request


> I agree that not all websites have their main address as "www".  But 
> given that I am up until now unable to effectively remove all the extra 
> domains that are captured and I am therefore bringing in a lot of 
> extraneous domain names, I have to choose between the lesser of two 
> evils -- lose some domains or pull in a lot of unwanted domain names 
> that totally pollute my desired results.

It's a choice that I can understand. You are the only one to exactly know
your needs and your contrainsts.

> I wish there was a way to capture ONLY the initially requested URL that 
> is either clicked or typed into the browser address bar.

So you might have to plugin the browser. :D

> I was thinking that maybe a tap might solve this problem because it 
> would capture only one half of a duplex conversation on one wire (the 
> outgoing request) and thus only capture the requested URL.

I don't know if I clearly understand what you mean, but advertising links
and meta content which are part of a page and which pollute your results
don't come from the http response but subsequent requests made by the
browser to get the additional content.

> Your suggestion of parsing the data is indeed unique and intersting.  
> Are you suggesting that dumpcap or ethereal would somehow interogate the 
> link, follow it and then make a determination.  This sounds like a very 
> interesting prospect but I'm not fully sure I understand how it would work.

It were a solution a bit "violent". ;) This information could be extracted
by a mime-type analyzer processing the content of the page.
But I had forgot a more simpler solution. You can read the mime-type
announced by the webserver, located in the "Content-type" field. As says
RFC 2616, this information isn't required in the packet but in fact it is
allmost always there. So when this field is present in the http response,
you can parse it and check that the response is html, plain text, or xml.
If it isn't, you can discard it (by this way, you'll be able to ignore
images, videos, applets, javascript files, css files (less important because
they are commonly hosted on the same domain), so a large part of noise.

The difficulty is that if you extract the required url from the request, you
have to make a relation between the request and the response and you might
need scripting. You could bypass this limitation by working only on the
response but I don't have studied that point especially.


Best regards