On 8/3/14, 11:34 AM, Evan Huus wrote:
> On Mon, May 13, 2013 at 7:54 PM, Gerald Combs <gerald@xxxxxxxxxxxxx
> <mailto:gerald@xxxxxxxxxxxxx>> wrote:
>
> On 5/10/13 1:47 PM, Evan Huus wrote:
> > Hi Gerald
> >
> > I just cloned the Wireshark git mirror onto a new machine and was
> > surprised at how large it was to download. Running an aggressive git
> > gc on the finished clone reduced the disk usage on my machine from
> > ~500MB to ~150MB.
> >
> > I'm a bit surprised - git is supposed to automatically garbage collect
> > repositories when they get too cluttered, but perhaps its threshold
> > for automatic gc is just very high.
> >
> > I pinged Balint (CCed) about this and he suggested running gc on a
> > weekly basis and gc --aggressive on a monthly basis on the server. It
> > would probably save a non-trivial amount of bandwidth in the long term
> > as more people clone the repository.
>
> It might be due to our particular circumstances (a bare repository only
> updated via the mirror script) but git's automatic garbage collection
> doesn't seem to happen very often. The mirror script runs "git gc
> --auto" each time it synchronizes which keeps it from filling up the
> disk (which happened early on) but as you point out there is room for
> improvement. I added a cron job that runs "git gc --aggressive" each
> week. Here is the output from a manual run, which includes "git
> count-objects -v" before and after:
>
> 2013-05-13 14:38:12: Started.
> 2013-05-13 14:38:12: Synchronizing repository wireshark
> 2013-05-13 14:38:12: Object count start
> count: 0
> size: 0
> in-pack: 316591
> packs: 45
> size-pack: 567146
> prune-packable: 0
> garbage: 0
> 2013-05-13 14:38:12: Collecting garbage
> 2013-05-13 15:09:56: Object count start
> count: 0
> size: 0
> in-pack: 316596
> packs: 2
> size-pack: 127499
> prune-packable: 0
> garbage: 0
> 2013-05-13 15:09:56: Done
>
>
> So it's been over a year since this conversation and we have actually
> migrated to Git/Gerrit so I have no idea what Gerrit is doing in this
> regard (is there even a "real" git repository backing it, or is it all
> internal magic?), but I recently came across [1] which suggests that
> repeated use of --aggressive maybe wasn't such a good idea after all.
>
> It suggests just sticking to regular `git gc` except in cases of large
> one-time imports (like we did on migration) at which point you should
> run the apparently-very-slow `git repack -a -d --depth=250 --window=250`.
>
> FWIW, a fresh clone from Gerrit right now is 213MB - my local repo is
> only 161MB, and my current desktop is actually not beefy enough to run
> the recommended repack command so I have no idea what improvement that
> would give.
It's a "real" git repository but any operations performed by Gerrit are
done using JGit. The weekly automatic number update script runs `gerrit
gc --all`, which uses JGit's garbage collector. Many sites including
Google appear to run it one or more times a day. We may want to to the same.
I tried running git `repack -a -d --depth=250 --window=250` on the
server. It ran successfully and shrunk the repository from 248 MB to 208
MB but now the OS X builders are timing out during `git fetch`...