Wireshark-users: Re: [Wireshark-users] filter for ONLY initial get request
From: "Thierry Emmanuel" <Emmanuel.Thierry@xxxxxxxxxxxxxxx>
Date: Wed, 11 Aug 2010 15:35:18 +0200

-----Original Message-----
From: wireshark-users-bounces@xxxxxxxxxxxxx [mailto:wireshark-users-bounces@xxxxxxxxxxxxx] On Behalf Of Jeffs
Sent: mercredi 11 août 2010 15:07
To: Community support list for Wireshark
Subject: Re: [Wireshark-users] filter for ONLY initial get request

>
> This formula, however, only returns results minus the links and images 
> embedded in the web page:
> 
> tshark -r test.cap -T fields -e http.host  | sed 's/?.*$//' | sed -n 
> '/www./p'  | sort | uniq -c | sort -rn | head -n 100
> 
> 15 www.propertyshark.com
>       8 www.nytimes.com
>       2 www.google-analytics.com
>       1 www.facebook.com
> 
> 
> However, I am new to regex so I'm sure I may be missing  something or 
> losing some links.
>


It is a common mistake to consider that every websites have their main
address on a "www" subdomain. If you want a generic filter, you cannot
rely on it. If you want a relevant result, you'll have to build a
non-restrictive regexp and manually filter unappropriate results,
eventually making some rules to exclude well-known advertising sites.

A fully automatic solution would be to parse the data checking it is
a well-formed html (or xml or plain-text) document. This will purge
videos and images from your results.