What steps will reproduce the problem?

Take a large sub-directory in a repository (e.g. ccash) with some files within,

 $ tar -xzf ccash.tar.gz
 $ du -sh ccash
 59M    ccash
 $ ls -l ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar 
 -rw-r--r-- 1 dietz dietz   1748 Jul 27  2011 ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java
 -rw-r--r-- 1 dietz dietz 313898 May 22 18:36 ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar

Annex it,

 $ git annex add ccash
 ...
 $ ls -l ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar 
 lrwxrwxrwx 1 dietz dietz 215 Jul 27  2011 ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java -> ../../../../../../../../../../../.git/annex/objects/mv/zf/SHA256-s1748--5c0d1cbf104214b6d0ab85c53a85cadb975ec208f42a7b33a76d85e175352486/SHA256-s1748--5c0d1cbf104214b6d0ab85c53a85cadb975ec208f42a7b33a76d85e175352486
 lrwxrwxrwx 1 dietz dietz 210 Jul 27  2011 ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar -> ../../../../../../../../.git/annex/objects/8G/gQ/SHA256-s313898--593552ffea3c5823c6602478b5002a7c525fd904a3c44f1abe4065c22edfac73/SHA256-s313898--593552ffea3c5823c6602478b5002a7c525fd904a3c44f1abe4065c22edfac73

Unannex it (before or after committing),

 $ git annex unannex ccash

Note that some fraction of the files will still be symbolic links, now pointing to non-existent files. This data has apparently been lost forever.

 $ ls -l ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar 
 -rw-r--r-- 1 dietz dietz 1748 Jul 27  2011 ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java
 lrwxrwxrwx 1 dietz dietz  210 Jul 27  2011 ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar -> ../../../../../../../../.git/annex/objects/8G/gQ/SHA256-s313898--593552ffea3c5823c6602478b5002a7c525fd904a3c44f1abe4065c22edfac73/SHA256-s313898--593552ffea3c5823c6602478b5002a7c525fd904a3c44f1abe4065c22edfac73

It is unclear why some files are affected while others are not. That being said, unannexing small numbers of files at a time appears to avoid the issue,

 $ tar -zxf ccash.tar.gz
 $ git annex add ccash
 $ ls -l ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar 
 lrwxrwxrwx 1 dietz dietz 215 Jul 27  2011 ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java -> ../../../../../../../../../../../.git/annex/objects/mv/zf/SHA256-s1748--5c0d1cbf104214b6d0ab85c53a85cadb975ec208f42a7b33a76d85e175352486/SHA256-s1748--5c0d1cbf104214b6d0ab85c53a85cadb975ec208f42a7b33a76d85e175352486
 lrwxrwxrwx 1 dietz dietz 210 Jul 27  2011 ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar -> ../../../../../../../../.git/annex/objects/8G/gQ/SHA256-s313898--593552ffea3c5823c6602478b5002a7c525fd904a3c44f1abe4065c22edfac73/SHA256-s313898--593552ffea3c5823c6602478b5002a7c525fd904a3c44f1abe4065c22edfac73
 $ git annex unannex ccash/trunk/DataProvider/WebContent/WEB-INF
 ...
 $ ls -l ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar 
 lrwxrwxrwx 1 dietz dietz    215 Jul 27  2011 ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java -> ../../../../../../../../../../../.git/annex/objects/mv/zf/SHA256-s1748--5c0d1cbf104214b6d0ab85c53a85cadb975ec208f42a7b33a76d85e175352486/SHA256-s1748--5c0d1cbf104214b6d0ab85c53a85cadb975ec208f42a7b33a76d85e175352486
 -rw-r--r-- 1 dietz dietz 313898 Jul 27  2011 ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar

For this reason, it seems likely this is due to some sort of race condition.

What version of git-annex are you using? On what operating system?

This is on Ubuntu 12.04 with git-annex revision a1e2bc4.

Here is a quick script which reproduces the issue on another Ubuntu 12.04 machine,

 mkdir hi
 cd hi
 wget "http://downloads.sourceforge.net/project/free-cad/FreeCAD%20Source/freecad-0.11.3729.tar.gz"

 git init
 git annex init
 tar -zxf freecad-0.11.3729.tar.gz 
 git annex add FreeCAD-0.11.3729
 git annex unannex FreeCAD-0.11.3729
 echo "The following links are broken:"
 find -L . -type l

This results in dozens of dead symlinks.

What's going on here is you have multiple files with the same content, so the symlinks point to the same annexed file. When unannex processes the first symlink, it moves the annexed file to replace it. This breaks the other symlink that pointed to it. Notice that if you then re-add the file to the annex, the broken symlink automatically gets fixed -- there's no actual data loss going on here.

This problem can be avoided by using git annex unannex --fast, which makes hardlinks to the annexed file. But then you are also left with the hard links in .git/annex/objects.. git annex unused can find and remove them.

It may make sense to make the current "--fast" behavior the default for unannex..

Comment by http://joeyh.name/ Thu Sep 6 14:55:58 2012
Frankly, even the --fast behavior has an element of surprise to it. For example, one might have two files with identical content. Upon annexing and unannex they suddenly become a hard link to the same file, correct? If this is the case, changes to one will result in changes to the other. I would consider this a very nasty sort of surprise.
Perhaps the solution is to make --fast the default and to make it copy files when the content in the annex already has a hard link to it.
Comment by http://joeyh.name/ Sun Sep 9 16:53:35 2012
That sounds far more reasonable.
Has any progress been made here? While this issue may not result in data loss, the behavior documented in this bug is certainly surprising and does not instill confidence in new users.

If unannex makes the file a hard link to the annexed content, it will be mode 444 or so. But if the user changes the permissions and modifys it, that will corrupt the content still in the annex!

So the current --fast behavior seems no worse than the proposed behavior. And it's not at all clear to me that this would be a better default behavior for unannex than the current behavior, which at least ensures that data left in the annex (and referred to by another annexed file) cannot be corrupted.

Comment by http://joeyh.name/ Sun Sep 23 18:02:45 2012
Comments on this page are closed.