Don’t report OOM as a bug!
To do its job, reposurgeon needs to hold all of your history’s metadata in memory. That doesn’t mean the content part, but does mean all of the changeset attributions, comments, and tags. Given a large enough repository, this will overrun the RAM of a small machine. If this happens to you, your reposurgeon instance will die abruptly with an OOM (Out Of Memory) error while attempting to read in your repository.
It is extremely unlikely that this is due to a bug in reposurgeon. Before filing an issue about it, there is a procedure you should try. It consists of bisecting on the parameter the Go language runtime uses to control the frequency of garbage collection. You can set this using the environment variable GOGC or reposurgeon "gc" command.
GOGC defaults to 100, which instructs the runtime to garbage-collect when the heap size is 100% bigger than (i.e., twice) the size of the reachable objects after the last garbage collection. To increase the frequency of GC, usually resulting in a lower memory high-water mark, decrease that percentage threshold. To decrease gc frequency, increase the threshold so the runtime tolerates a larger heap.
To troubleshoot your OOM problem, bisect on this threshold to find the highest value that will avoid OOM. Start by cutting it to 50, then to 25, then to 12, then to 6. If you find a value that allows you to read to completion, you may want to try increasing it by a half interval (e,g. 50 to 75, 25 to 37, etc.) to get back some throughput.
If your repository won’t read in at GOGC=6 you have a real problem. Unfortunately, it’s not one the reposurgeon devteam can help you with; the correct solution to it is to do your conversion on a machine with more RAM. 64GB should be sufficient. The largest repository the reposurgeon devteam has ever seen (the history of GCC, 280K Subversion commits) fit on a 128GB machine with GOGC=30.
If you can’t read your history onto a 128GB machine with GOGC=30, then maybe the reposurgeon devteam ought to hear about it. That said, if you can find ways to make reposurgeon more efficient, we are eager to accept those patches, or even just a bug report with the details. It’s probable that there are some efficiency gains yet to be made.
How to report non-OOM bugs
reposurgeon is an extremely algorithmically complex program. It may still have bugs when dealing with strange corner cases in older repositories.
It is often extremely difficult or impossible to reproduce those bugs without a copy of the history on which they occurred. When you find a bug, please send me:
(a) An exact description of how observed behavior differed from expected behavior. If reposurgeon died, include the backtrace.
(b) A git fast-import or Subversion dump file of the repository you were operating on, or a pointer to where I can pull it from.
(c) A script containing the sequence of reposurgeon commands that revealed the bug.
Please use the reposurgeon project’s issue tracker and attach these files. It’s helpful.
Test case reduction
If you know how to reproduce the error, the best possible test case is a hand-crafted repository of minimal size with content that explains how it breaks the tool. I turn these into regression tests instantly.
When you don’t know the cause of the error, ship me a dump file derived from the real repository that triggered it. To speed up my debugging process so you can get an answer more quickly, there are some tactics you can use to reduce the bulk of the test case you send me. Also, a well-reduced dump can become a regression test to ensure the bug does not recur.
How to make dumps in Git: cd to your git repository and capture the output of "repotool export".
How to make dumps in Subversion: cd to the toplevel directory of the repository master - not a checkout of it, you can tell you’re in the right place if you see this:
$ ls conf db format hooks locks README.txt
Then run "repotool export", capturing the output.
The commands you will use for test-case reduction are reposurgeon and, on Subversion dumps, repocutter.
Replace the content blobs in the dump with stubs
The subcommand in both tools is 'strip'; it will usually cut the size of the dump by more than a factor of 10. Check that the bug still reproduces on the stripped dump; if it doesn’t, that would be unprecedented and interesting in itself.
If you are trying to maintain confidentiality about your code, sending me a stripped repo has the additional advantage that I won’t see the code! The command preserves structure and metadata but replaces each content blob with a unique magic cookie.
Truncate the dump as much as possible
Try to truncate the dump to the shortest leading section that reproduces the bug.
A reposurgeon error message will normally include a mark, event number, or (when reading a Subversion dump) a Subversion revision number. Use a selection-set argument to reposurgeon’s 'write' command, or the 'select' subcommand of repocutter, to pare down the dump so that it ends just after the error is triggered. Again, check to ensure that the bug reproduces on the truncated dump.
If the error message doesn’t tell you where the problem occurred, try a bisection process. Use the --readlimit option of the read to ignore the last half of the events in the dump; check to see if the bug reproduces. If it does, repeat; it it does not, throw out the last quarter, then the last eighth, and so forth. Keep this up until you can no longer drop events without making the bug go away.
Bisection is more effective than you might expect, because the kinds of repository malformations that trigger misbehavior from reposurgeon tend to rise in frequency as you go back in time. The largest single category of them has been ancient cruft produced by cvs2svn conversions.
Topologically reduce the dump
Next, topologically reduce the dump, throwing out boring commits that are unlikely to be related to your problem.
If a commit has all file modifications (no additions or deletions or copies or renames) and has exactly one ancestor and one descendant, then it may be boring. In a Subversion dump it also has to not have any property changes; in a git dump it must not be referred to by any tag or reset. Interesting commits are not boring, or have a not-boring parent or not-boring child.
Try using the 'reduce' subcommand of repocutter to strip boring commits out of a Subversion dump. For a git dump, look at "strip reduce".
Prune irrelevant branches
Try to throw away branches that are not relevant to your problem. The 'expunge' operation on repocutter or the 'branch delete' command in reposurgeon may be helpful.
This is the attempted simplification least unlikely to make your bug vanish, so check that carefully after each branch deletion.
Know how to spot possible importer bugs
If your target VCS’s importer dies during a rebuild, try writing the repository content to a stream instead and importing the stream by hand. If the latter does not fail, the target VCS’s importer may be slightly buggy - but you have a workaround.
(This has been observed under git 2.5.0 with the result of a unite operation on two repositories. The cause is unknown, as git dies suddenly enough to not leave a crash report.)
Benchmarking
A fair amount of effort has been expended to keep the run-time performance of Reposurgeon as linear as possible. This is not an easy state to stay in; it is unfortunately quite simple to accidentally regress this without noticing.
To that end, there are some fairly simple scripts in the bench
directory that can be used to check for this type of
problem. bench.sh
runs reposurgeon multiple times with a different
readlimit each time, recording the run time and memory allocated at
each iteration. Supply arguments specifying the dump file to read and
the readlimit values to use like this:
./bench/bench.sh your-dump-file.svn 1000 2000 20000
This reads your-dump-file.svn
10 times, with the readlimit set first
to 1,000, then 3,000, etc, stepping up until it reaches 20,000.
This produces a .dat
file which you can use with plot.sh
,
svg.sh
, or compare.sh
to produce graphs.
For an example, see bench/oops.svg
. This shows
a graph made using good revision that had linear performance, several
made with revisions that introduced a regression that made performance
quite non-linear, and the fix. You can easily tell the difference
visually.
Emergency help
If you need emergency help, go to the #reposurgeon IRC channel listed on the project web page. Be aware, however, that I am too busy to babysit difficult repository conversions unless I have explicitly volunteered for one or someone is paying me to care about it. For explanation, see Your money or your spec