Wireshark-dev: Re: [Wireshark-dev] Wireshark Git Mirror Maintenance
From: Evan Huus <eapache@xxxxxxxxx>
Date: Sun, 3 Aug 2014 18:34:11 -0400
On Sun, Aug 3, 2014 at 6:20 PM, Gerald Combs <gerald@xxxxxxxxxxxxx> wrote:
On 8/3/14, 11:34 AM, Evan Huus wrote:
> On Mon, May 13, 2013 at 7:54 PM, Gerald Combs <gerald@xxxxxxxxxxxxx
> <mailto:gerald@xxxxxxxxxxxxx>> wrote:
>
>     On 5/10/13 1:47 PM, Evan Huus wrote:
>     > Hi Gerald
>     >
>     > I just cloned the Wireshark git mirror onto a new machine and was
>     > surprised at how large it was to download. Running an aggressive git
>     > gc on the finished clone reduced the disk usage on my machine from
>     > ~500MB to ~150MB.
>     >
>     > I'm a bit surprised - git is supposed to automatically garbage collect
>     > repositories when they get too cluttered, but perhaps its threshold
>     > for automatic gc is just very high.
>     >
>     > I pinged Balint (CCed) about this and he suggested running gc on a
>     > weekly basis and gc --aggressive on a monthly basis on the server. It
>     > would probably save a non-trivial amount of bandwidth in the long term
>     > as more people clone the repository.
>
>     It might be due to our particular circumstances (a bare repository only
>     updated via the mirror script) but git's automatic garbage collection
>     doesn't seem to happen very often. The mirror script runs "git gc
>     --auto" each time it synchronizes which keeps it from filling up the
>     disk (which happened early on) but as you point out there is room for
>     improvement. I added a cron job that runs "git gc --aggressive" each
>     week. Here is the output from a manual run, which includes "git
>     count-objects -v" before and after:
>
>     2013-05-13 14:38:12: Started.
>     2013-05-13 14:38:12: Synchronizing repository wireshark
>     2013-05-13 14:38:12: Object count start
>     count: 0
>     size: 0
>     in-pack: 316591
>     packs: 45
>     size-pack: 567146
>     prune-packable: 0
>     garbage: 0
>     2013-05-13 14:38:12: Collecting garbage
>     2013-05-13 15:09:56: Object count start
>     count: 0
>     size: 0
>     in-pack: 316596
>     packs: 2
>     size-pack: 127499
>     prune-packable: 0
>     garbage: 0
>     2013-05-13 15:09:56: Done
>
>
> So it's been over a year since this conversation and we have actually
> migrated to Git/Gerrit so I have no idea what Gerrit is doing in this
> regard (is there even a "real" git repository backing it, or is it all
> internal magic?), but I recently came across [1] which suggests that
> repeated use of --aggressive maybe wasn't such a good idea after all.
>
> It suggests just sticking to regular `git gc` except in cases of large
> one-time imports (like we did on migration) at which point you should
> run the apparently-very-slow `git repack -a -d --depth=250 --window=250`.
>
> FWIW, a fresh clone from Gerrit right now is 213MB - my local repo is
> only 161MB, and my current desktop is actually not beefy enough to run
> the recommended repack command so I have no idea what improvement that
> would give.

It's a "real" git repository but any operations performed by Gerrit are
done using JGit. The weekly automatic number update script runs `gerrit
gc --all`, which uses JGit's garbage collector. Many sites including
Google appear to run it one or more times a day. We may want to to the same.

I tried running git `repack -a -d --depth=250 --window=250` on the
server. It ran successfully and shrunk the repository from 248 MB to 208
MB but now the OS X builders are timing out during `git fetch`...

Hmm, that's interesting, I would have expected a bigger improvement (given my local copy is still smaller than the one on the server). Perhaps it is worth trying an --aggressive gc just once (or passing the -f and -F flags to the existing repack command, which is probably even *more* aggressive).

No idea why the buildbots would be timing out... the gc shouldn't have materially affected their ability to pull down deltas I don't think.