Fix a pathological case in git detecting proper renames

Kumar Gala had a case in the u-boot archive with multiple renames of files with identical contents, and git would turn those into multiple "copy" operations of one of the sources, and just deleting the other sources. This patch makes the git exact rename detection prefer to spread out the renames over the multiple sources, rather than do multiple copies of one source. NOTE! The changes are a bit larger than required, because I also renamed the variables named "one" and "two" to "target" and "source" respectively. That makes the logic easier to follow, especially as the "one" was illogically the target and not the soruce, for purely historical reasons (this piece of code used to traverse over sources and targets in the wrong order, and when we fixed that, we didn't fix the names back then. So I fixed them now). The important part of this change is just the trivial score calculations for when files have identical contents: /* Give higher scores to sources that haven't been used already */ score = !source->rename_used; score += basename_same(source, target); and when we have multiple choices we'll now pick the choice that gets the best rename score, rather than only looking at whether the basename matched. It's worth noting a few gotchas: - this scoring is currently only done for the "exact match" case. In particular, in Kumar's example, even after this patch, the inexact match case is still done as a copy+delete rather than as two renames: delete mode 100644 board/cds/mpc8555cds/u-boot.lds copy board/{cds => freescale}/mpc8541cds/u-boot.lds (97%) rename board/{cds/mpc8541cds => freescale/mpc8555cds}/u-boot.lds (97%) because apparently the "cds/mpc8541cds/u-boot.lds" copy looked a bit more similar to both end results. That said, I *suspect* we just have the exact same issue there - the similarity analysis just gave identical (or at least very _close_ to identical) similarity points, and we do not have any logic to prefer multiple renames over a copy/delete there. That is a separate patch. - When you have identical contents and identical basenames, the actual entry that is chosen is still picked fairly "at random" for the first one (but the subsequent ones will prefer entries that haven't already been used). It's not actually really random, in that it actually depends on the relative alphabetical order of the files (which in turn will have impacted the order that the entries got hashed!), so it gives consistent results that can be explained. But I wanted to point it out as an issue for when anybody actually does cross-renames. In Kumar's case the choice is the right one (and for a single normal directory rename it should always be, since the relative alphabetical sorting of the files will be identical), and we now get: rename board/{cds => freescale}/mpc8541cds/init.S (100%) rename board/{cds => freescale}/mpc8548cds/init.S (100%) which is the "expected" answer. However, it might still be better to change the pedantic "exact same basename" on/off choice into a more graduated "how similar are the pathnames" scoring situation, in order to be more likely to get the exact rename choice that people *expect* to see, rather than other alternatives that may *technically* be equally good, but are surprising to a human. It's also unclear whether we should consider "basenames are equal" or "have already used this as a source" to be more important. This gives them equal weight, but I suspect we might want to just multiple the "basenames are equal" weight by two, or something, to prefer equal basenames even if that causes a copy/delete pair. I dunno. Anyway, what I'm just saying in a really long-winded manner is that I think this is right as-is, but it's not the complete solution, and it may want some further tweaking in the future. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
author: Linus Torvalds <torvalds@linux-foundation.org> 2007-11-29 21:30:13 (GMT)
committer: Junio C Hamano <gitster@pobox.com> 2007-11-30 23:49:17 (GMT)
commit: 32d75d29f99cca8e0874b1bdf94ded48b576c906 (patch)
tree: bda9cf164b0b0919059d0165ee28ee82a907fbe7 /diffcore-rename.c
parent: 28391a80a94d2b59d1d21f8264fe5dab91d77249 (diff)
download: git-32d75d29f99cca8e0874b1bdf94ded48b576c906.zip
git-32d75d29f99cca8e0874b1bdf94ded48b576c906.tar.gz
git-32d75d29f99cca8e0874b1bdf94ded48b576c906.tar.bz2
1 files changed, 16 insertions, 9 deletions
diff --git a/diffcore-rename.c b/diffcore-rename.c
index f9ebea5..f64294e 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -244,28 +244,35 @@ static int find_identical_files(struct file_similarity *src,
 	 * Walk over all the destinations ...
 	 */
 	do {
-		struct diff_filespec *one = dst->filespec;
+		struct diff_filespec *target = dst->filespec;
 		struct file_similarity *p, *best;
-		int i = 100;
+		int i = 100, best_score = -1;
 
 		/*
 		 * .. to find the best source match
 		 */
 		best = NULL;
 		for (p = src; p; p = p->next) {
-			struct diff_filespec *two = p->filespec;
+			int score;
+			struct diff_filespec *source = p->filespec;
 
 			/* False hash collission? */
-			if (hashcmp(one->sha1, two->sha1))
+			if (hashcmp(source->sha1, target->sha1))
 				continue;
 			/* Non-regular files? If so, the modes must match! */
-			if (!S_ISREG(one->mode) || !S_ISREG(two->mode)) {
-				if (one->mode != two->mode)
+			if (!S_ISREG(source->mode) || !S_ISREG(target->mode)) {
+				if (source->mode != target->mode)
 					continue;
 			}
-			best = p;
-			if (basename_same(one, two))
-				break;
+			/* Give higher scores to sources that haven't been used already */
+			score = !source->rename_used;
+			score += basename_same(source, target);
+			if (score > best_score) {
+				best = p;
+				best_score = score;
+				if (score == 2)
+					break;
+			}
 
 			/* Too many identical alternatives? Pick one */
 			if (!--i)
author	Linus Torvalds <torvalds@linux-foundation.org>	2007-11-29 21:30:13 (GMT)
committer	Junio C Hamano <gitster@pobox.com>	2007-11-30 23:49:17 (GMT)
commit	32d75d29f99cca8e0874b1bdf94ded48b576c906 (patch)
tree	bda9cf164b0b0919059d0165ee28ee82a907fbe7 /diffcore-rename.c
parent	28391a80a94d2b59d1d21f8264fe5dab91d77249 (diff)
download	git-32d75d29f99cca8e0874b1bdf94ded48b576c906.zip git-32d75d29f99cca8e0874b1bdf94ded48b576c906.tar.gz git-32d75d29f99cca8e0874b1bdf94ded48b576c906.tar.bz2