the git-annex branch of a repository i've had running since 2010 has grown to unmanagable dimensions (5gb in a fresh clone of the git-annex branch, while the master branch has merely 40mb, part of which is due to checked-in files), resulting in git-annex-merges to take in the order of magnitude of 15 minutes. getting an initial clone of the git-annex branch (not the data) takes hours alone in the "remote: Counting objects" phase (admittedly, the origin server is limited in ram, so it spends its time swapping the git process back and forth).

is there a recommended way for how to reset the git-annex branch in a coordinated way? of course, this would have to happen on all copies of the repo at the same time.

the workflow i currently imagine is

  • rename all copies of the repository (the_repo → the_repo-old, the_repo.git → the_repo-old.git)
  • clone the old origin repository to a new origin with --single-branch. (this would be the oportunity to git filter-branch --prune-empty --index-filter 'git rm --cached --ignore-unmatch .git-annex -r' master as well, to get rid of commits of pre-whatever versions)
  • git annex init on the master repository
  • clone it to all the other copies and git annex init there
  • set all the configuration options (untrusted repos etc) again
  • either
    • git annex reinject the files that are already present on the respective machines, or
    • move the .git/annex/objects files over from the original locations, and use git annex fsck to make git-annex discover which files it already has, if that works. (i have numcopies=2, thus i'd dare to move instead of copy even when trying this out the first time. complete copies, even of partially checked out clones, will exceed the capacities of most clients)

my questions in that endeavor are:

  • is there already a standard workflow for this?
  • if not, will the above do the trick?
  • can anything be done to avoid such problems in future?

Yes, you can use fsck like that. I outlined a similar approach here, and I think you don't even need to make new git repositories, just delete the old branch and git gc it -- but I've not heard of anyone doing this yet.

So, since 2010, your repo must have gone through at least one and probably two repository format changes, which bloated the git branch. Hopefully we'll have no more of those. My largest repo that also went through that is under 150 mb however.

There was a recent bug fix where git annex copy unnecessarily updated location log even when the file was already copied. That kind of thing can bloat the repository, especially if you had that in a cron job... You might find git annex log useful to look through history of files and see if there have been a lot of location changes logged for whatever reason.

Comment by http://joeyh.name/ Wed Oct 31 16:03:55 2012

the procedure i outlined originally worked well for me; the method chosen for reinjection was moving over the .git/annex/objects directory and doing a git annex fsck.

special care had to be taken of the special remote (rsync+gpg) -- i guess that's why they are called special ;-) . as described in the forum post you linked, i had to copy over remote.log and the uuid.log line from the old git-annex branch -- otherwise, a git annex initremote would have generated a new hmac, effectively resetting the remote repo.

the formerly 5gb git-annex branch (admittedly not git gc'd recently, but that just wasn't feasible any more) shrunk down to around 25mb of current location information. i'll keep an eye on how it's growing to see if the problem is inherent or if it was just old bugs causing trouble.

Comment by http://christian.amsuess.com/chrysn Sun Nov 4 12:23:56 2012
Comments on this page are closed.