revision: implement sparse algorithm

When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees from the interesting commits to find the interesting objects that are placed in the pack. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the performance, we need to terminate our recursion. If the oidset contains only UNINTERESTING trees, then we do not continue the recursion. This avoids walking trees that are likely to not be reachable from interesting trees. If the oidset contains only interesting trees, then we will walk these trees in the final stage that collects the intersting objects to place in the pack. Thus, we only recurse if the oidset contains both interesting and UNINITERESTING trees. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. A test is modified to demonstrate this behavior and to verify that the new logic is being exercised. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack: 220 Walked (old alg): 22,804 Walked (new alg): 129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
author: Derrick Stolee <dstolee@microsoft.com> 2019-01-16 18:25:59 (GMT)
committer: Junio C Hamano <gitster@pobox.com> 2019-01-17 21:44:42 (GMT)
commit: d5d2e93577e2b9f4a251f97116259346f0bead1e (patch)
tree: 856743cea4c9501323aa66c995c4600a08c9c670 /revision.c
parent: 4f6d26b16703e59e009fe5dde923b87793c5f561 (diff)
download: git-d5d2e93577e2b9f4a251f97116259346f0bead1e.zip
git-d5d2e93577e2b9f4a251f97116259346f0bead1e.tar.gz
git-d5d2e93577e2b9f4a251f97116259346f0bead1e.tar.bz2
1 files changed, 128 insertions, 10 deletions
diff --git a/revision.c b/revision.c
index 60421f3..5de7393 100644
--- a/revision.c
+++ b/revision.c
@@ -27,6 +27,7 @@
 #include "commit-reach.h"
 #include "commit-graph.h"
 #include "prio-queue.h"
+#include "hashmap.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -99,29 +100,146 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
 	mark_tree_contents_uninteresting(r, tree);
 }
 
+struct path_and_oids_entry {
+	struct hashmap_entry ent;
+	char *path;
+	struct oidset trees;
+};
+
+static int path_and_oids_cmp(const void *hashmap_cmp_fn_data,
+			     const struct path_and_oids_entry *e1,
+			     const struct path_and_oids_entry *e2,
+			     const void *keydata)
+{
+	return strcmp(e1->path, e2->path);
+}
+
+static void paths_and_oids_init(struct hashmap *map)
+{
+	hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, NULL, 0);
+}
+
+static void paths_and_oids_clear(struct hashmap *map)
+{
+	struct hashmap_iter iter;
+	struct path_and_oids_entry *entry;
+	hashmap_iter_init(map, &iter);
+
+	while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) {
+		oidset_clear(&entry->trees);
+		free(entry->path);
+	}
+
+	hashmap_free(map, 1);
+}
+
+static void paths_and_oids_insert(struct hashmap *map,
+				  const char *path,
+				  const struct object_id *oid)
+{
+	int hash = strhash(path);
+	struct path_and_oids_entry key;
+	struct path_and_oids_entry *entry;
+
+	hashmap_entry_init(&key, hash);
+
+	/* use a shallow copy for the lookup */
+	key.path = (char *)path;
+	oidset_init(&key.trees, 0);
+
+	if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) {
+		entry = xcalloc(1, sizeof(struct path_and_oids_entry));
+		hashmap_entry_init(entry, hash);
+		entry->path = xstrdup(key.path);
+		oidset_init(&entry->trees, 16);
+		hashmap_put(map, entry);
+	}
+
+	oidset_insert(&entry->trees, oid);
+}
+
+static void add_children_by_path(struct repository *r,
+				 struct tree *tree,
+				 struct hashmap *map)
+{
+	struct tree_desc desc;
+	struct name_entry entry;
+
+	if (!tree)
+		return;
+
+	if (parse_tree_gently(tree, 1) < 0)
+		return;
+
+	init_tree_desc(&desc, tree->buffer, tree->size);
+	while (tree_entry(&desc, &entry)) {
+		switch (object_type(entry.mode)) {
+		case OBJ_TREE:
+			paths_and_oids_insert(map, entry.path, entry.oid);
+
+			if (tree->object.flags & UNINTERESTING) {
+				struct tree *child = lookup_tree(r, entry.oid);
+				if (child)
+					child->object.flags |= UNINTERESTING;
+			}
+			break;
+		case OBJ_BLOB:
+			if (tree->object.flags & UNINTERESTING) {
+				struct blob *child = lookup_blob(r, entry.oid);
+				if (child)
+					child->object.flags |= UNINTERESTING;
+			}
+			break;
+		default:
+			/* Subproject commit - not in this repository */
+			break;
+		}
+	}
+
+	free_tree_buffer(tree);
+}
+
 void mark_trees_uninteresting_sparse(struct repository *r,
 				     struct oidset *trees)
 {
+	unsigned has_interesting = 0, has_uninteresting = 0;
+	struct hashmap map;
+	struct hashmap_iter map_iter;
+	struct path_and_oids_entry *entry;
 	struct object_id *oid;
 	struct oidset_iter iter;
 
 	oidset_iter_init(trees, &iter);
-	while ((oid = oidset_iter_next(&iter))) {
+	while ((!has_interesting || !has_uninteresting) &&
+	       (oid = oidset_iter_next(&iter))) {
 		struct tree *tree = lookup_tree(r, oid);
 
 		if (!tree)
 			continue;
 
-		if (tree->object.flags & UNINTERESTING) {
-			/*
-			 * Remove the flag so the next call
-			 * is not a no-op. The flag is added
-			 * in mark_tree_unintersting().
-			 */
-			tree->object.flags ^= UNINTERESTING;
-			mark_tree_uninteresting(r, tree);
-		}
+		if (tree->object.flags & UNINTERESTING)
+			has_uninteresting = 1;
+		else
+			has_interesting = 1;
+	}
+
+	/* Do not walk unless we have both types of trees. */
+	if (!has_uninteresting || !has_interesting)
+		return;
+
+	paths_and_oids_init(&map);
+
+	oidset_iter_init(trees, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct tree *tree = lookup_tree(r, oid);
+		add_children_by_path(r, tree, &map);
 	}
+
+	hashmap_iter_init(&map, &map_iter);
+	while ((entry = hashmap_iter_next(&map_iter)))
+		mark_trees_uninteresting_sparse(r, &entry->trees);
+
+	paths_and_oids_clear(&map);
 }
 
 struct commit_stack {
author	Derrick Stolee <dstolee@microsoft.com>	2019-01-16 18:25:59 (GMT)
committer	Junio C Hamano <gitster@pobox.com>	2019-01-17 21:44:42 (GMT)
commit	d5d2e93577e2b9f4a251f97116259346f0bead1e (patch)
tree	856743cea4c9501323aa66c995c4600a08c9c670 /revision.c
parent	4f6d26b16703e59e009fe5dde923b87793c5f561 (diff)
download	git-d5d2e93577e2b9f4a251f97116259346f0bead1e.zip git-d5d2e93577e2b9f4a251f97116259346f0bead1e.tar.gz git-d5d2e93577e2b9f4a251f97116259346f0bead1e.tar.bz2