summaryrefslogtreecommitdiff
path: root/pack-bitmap.c
diff options
context:
space:
mode:
authorJeff King <peff@peff.net>2021-06-14 12:05:44 (GMT)
committerJunio C Hamano <gitster@pobox.com>2021-06-15 02:13:11 (GMT)
commitaa9ad6fee54898b9965f4fd26b3035fdd7b20f37 (patch)
tree907a8a7e9b64257c3c4466727e4011f5bd48c8ec /pack-bitmap.c
parentebf3c04b262aa27fbb97f8a0156c2347fecafafb (diff)
downloadgit-aa9ad6fee54898b9965f4fd26b3035fdd7b20f37.zip
git-aa9ad6fee54898b9965f4fd26b3035fdd7b20f37.tar.gz
git-aa9ad6fee54898b9965f4fd26b3035fdd7b20f37.tar.bz2
bitmaps: don't recurse into trees already in the bitmap
If an object is already mentioned in a reachability bitmap we are building, then by definition so are all of the objects it can reach. We have an optimization to stop traversing commits when we see they are already in the bitmap, but we don't do the same for trees. It's generally unavoidable to recurse into trees for commits not yet covered by bitmaps (since most commits generally do have unique top-level trees). But they usually have subtrees that are shared with other commits (i.e., all of the subtrees the commit _didn't_ touch). And some of those commits (and their trees) may be covered by the bitmap. Usually this isn't _too_ big a deal, because we'll visit those subtrees only once in total for the whole walk. But if you have a large number of unbitmapped commits, and if your tree is big, then you may end up opening a lot of sub-trees for no good reason. We can use the same optimization we do for commits here: when we are about to open a tree, see if it's in the bitmap (either the one we are building, or the "seen" bitmap which covers the UNINTERESTING side of the bitmap when doing a set-difference). This works especially well because we'll visit all commits before hitting any trees. So even in a history like: A -- B if "A" has a bitmap on disk but "B" doesn't, we'll already have OR-ed in the results from A before looking at B's tree (so we really will only look at trees touched by B). For most repositories, the timings produced by p5310 are unspectacular. Here's linux.git: Test HEAD^ HEAD -------------------------------------------------------------------- 5310.4: simulated clone 6.00(5.90+0.10) 5.98(5.90+0.08) -0.3% 5310.5: simulated fetch 2.98(5.45+0.18) 2.85(5.31+0.18) -4.4% 5310.7: rev-list (commits) 0.32(0.29+0.03) 0.33(0.30+0.03) +3.1% 5310.8: rev-list (objects) 1.48(1.44+0.03) 1.49(1.44+0.05) +0.7% Any improvement there is within the noise (the +3.1% on test 7 has to be noise, since we are not recursing into trees, and thus the new code isn't even run). The results for git.git are likewise uninteresting. But here are numbers from some other real-world repositories (that are not public). This one's tree is comparable in size to linux.git, but has ~16k refs (and so less complete bitmap coverage): Test HEAD^ HEAD ------------------------------------------------------------------------- 5310.4: simulated clone 38.34(39.86+0.74) 33.95(35.53+0.76) -11.5% 5310.5: simulated fetch 2.29(6.31+0.35) 2.20(5.97+0.41) -3.9% 5310.7: rev-list (commits) 0.99(0.86+0.13) 0.96(0.85+0.11) -3.0% 5310.8: rev-list (objects) 11.32(11.04+0.27) 6.59(6.37+0.21) -41.8% And here's another with a very large tree (~340k entries), and a fairly large number of refs (~10k): Test HEAD^ HEAD ------------------------------------------------------------------------- 5310.3: simulated clone 53.83(54.71+1.54) 39.77(40.76+1.50) -26.1% 5310.4: simulated fetch 19.91(20.11+0.56) 19.79(19.98+0.67) -0.6% 5310.6: rev-list (commits) 0.54(0.44+0.11) 0.51(0.43+0.07) -5.6% 5310.7: rev-list (objects) 24.32(23.59+0.73) 9.85(9.49+0.36) -59.5% This patch provides substantial improvements in these larger cases, and have any drawbacks for smaller ones (the cost of the bitmap check is quite small compared to an actual tree traversal). Note that we have to add a version of revision.c's include_check callback which handles non-commits. We could possibly consolidate this into a single callback for all objects types, as there's only one user of the feature which would need converted (pack-bitmap.c:should_include). That would in theory let us avoid duplicating any logic. But when I tried it, the code ended up much worse to read, with lots of repeated "if it's a commit do this, otherwise do that". Having two separate callbacks splits that naturally, and matches the existing split of show_commit/show_object callbacks. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Diffstat (limited to 'pack-bitmap.c')
-rw-r--r--pack-bitmap.c18
1 files changed, 18 insertions, 0 deletions
diff --git a/pack-bitmap.c b/pack-bitmap.c
index d90e1d9..bfc1014 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -525,6 +525,22 @@ static int should_include(struct commit *commit, void *_data)
return 1;
}
+static int should_include_obj(struct object *obj, void *_data)
+{
+ struct include_data *data = _data;
+ int bitmap_pos;
+
+ bitmap_pos = bitmap_position(data->bitmap_git, &obj->oid);
+ if (bitmap_pos < 0)
+ return 1;
+ if ((data->seen && bitmap_get(data->seen, bitmap_pos)) ||
+ bitmap_get(data->base, bitmap_pos)) {
+ obj->flags |= SEEN;
+ return 0;
+ }
+ return 1;
+}
+
static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
struct bitmap **base,
struct commit *commit)
@@ -620,6 +636,7 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
incdata.seen = seen;
revs->include_check = should_include;
+ revs->include_check_obj = should_include_obj;
revs->include_check_data = &incdata;
if (prepare_revision_walk(revs))
@@ -633,6 +650,7 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
&show_data, NULL);
revs->include_check = NULL;
+ revs->include_check_obj = NULL;
revs->include_check_data = NULL;
}