summaryrefslogtreecommitdiff
path: root/commit-graph.c
AgeCommit message (Collapse)Author
2020-11-02Merge branch 'ds/commit-graph-merging-fix'Junio C Hamano
When "git commit-graph" detects the same commit recorded more than once while it is merging the layers, it used to die. The code now ignores all but one of them and continues. * ds/commit-graph-merging-fix: commit-graph: don't write commit-graph when disabled commit-graph: ignore duplicates when merging layers
2020-10-09commit-graph: don't write commit-graph when disabledDerrick Stolee
The core.commitGraph config setting can be set to 'false' to prevent parsing commits from the commit-graph file(s). This causes an issue when trying to write with "--split" which needs to distinguish between commits that are in the existing commit-graph layers and commits that are not. The existing mechanism uses parse_commit() and follows by checking if there is a 'graph_pos' that shows the commit was parsed from the commit-graph file. When core.commitGraph=false, we do not parse the commits from the commit-graph and 'graph_pos' indicates that no commits are in the existing file. The --split logic moves forward creating a new layer on top that holds all reachable commits, then possibly merges down into those layers, resulting in duplicate commits. The previous change makes that merging process more robust to such a situation in case it happens in the written commit-graph data. The easy answer here is to avoid writing a commit-graph if reading the commit-graph is disabled. Since the resulting commit-graph will would not be read by subsequent Git processes. This is more natural than forcing core.commitGraph to be true for the 'write' process. Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de> Helped-by: Jeff King <peff@peff.net> Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-09commit-graph: ignore duplicates when merging layersDerrick Stolee
Thomas reported [1] that a "git fetch" command was failing with an error saying "unexpected duplicate commit id". The root cause is that they had fetch.writeCommitGraph enabled which generates commit-graph chains, and this instance was merging two layers that both contained the same commit ID. [1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/ The initial assumption is that Git would not write a commit ID into a commit-graph layer if it already exists in a lower commit-graph layer. Somehow, this specific case did get into that situation, leading to this error. While unexpected, this isn't actually invalid (as long as the two layers agree on the metadata for the commit). When we parse a commit that does not have a graph_pos in the commit_graph_data_slab, we use binary search in the commit-graph layers to find the commit and set graph_pos. That position is never used again in this case. However, when we parse a commit from the commit-graph file, we load its parents from the commit-graph and assign graph_pos at that point. If those parents were already parsed from the commit-graph, then nothing needs to be done. Otherwise, this graph_pos is a valid position in the commit-graph so we can parse the parents, when necessary. Thus, this die() is too aggressive. The easiest thing to do would be to ignore the duplicates. If we only ignore the duplicates, then we will produce a commit-graph that has identical commit IDs listed in adjacent positions. This excess data will never be removed from the commit-graph, which could cascade into significantly bloated file sizes. Thankfully, we can collapse the list to erase the duplicate commit pointers. This allows us to get the end result we want without extra memory costs and minimal CPU time. The root cause is due to disabling core.commitGraph, which prevents parsing commits from the lower layers during a 'git commit-graph write --split' command. Since we use the 'graph_pos' value to determine whether a commit is in a lower layer, we never discover that those commits are already in the commit-graph chain and add them to the top layer. This layer is then merged down, creating duplicates. The test added in t5324-split-commit-graph.sh fails without this change. However, we still have not completely removed the need for this duplicate check. That will come in a follow-up change. Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de> Helped-by: Taylor Blau <me@ttaylorr.com> Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-29Merge branch 'tb/bloom-improvements'Junio C Hamano
"git commit-graph write" learned to limit the number of bloom filters that are computed from scratch with the --max-new-filters option. * tb/bloom-improvements: commit-graph: introduce 'commitGraph.maxNewFilters' builtin/commit-graph.c: introduce '--max-new-filters=<n>' commit-graph: rename 'split_commit_graph_opts' bloom: encode out-of-bounds filters as non-empty bloom/diff: properly short-circuit on max_changes bloom: use provided 'struct bloom_filter_settings' bloom: split 'get_bloom_filter()' in two commit-graph.c: store maximum changed paths commit-graph: respect 'commitGraph.readChangedPaths' t/helper/test-read-graph.c: prepare repo settings commit-graph: pass a 'struct repository *' in more places t4216: use an '&&'-chain commit-graph: introduce 'get_bloom_filter_settings()'
2020-09-25Merge branch 'ds/maintenance-part-1'Junio C Hamano
A "git gc"'s big brother has been introduced to take care of more repository maintenance tasks, not limited to the object database cleaning. * ds/maintenance-part-1: maintenance: add trace2 regions for task execution maintenance: add auto condition for commit-graph task maintenance: use pointers to check --auto maintenance: create maintenance.<task>.enabled config maintenance: take a lock on the objects directory maintenance: add --task option maintenance: add commit-graph task maintenance: initialize task array maintenance: replace run_auto_gc() maintenance: add --quiet option maintenance: create basic maintenance runner
2020-09-18builtin/commit-graph.c: introduce '--max-new-filters=<n>'Taylor Blau
Introduce a command-line flag to specify the maximum number of new Bloom filters that a 'git commit-graph write' is willing to compute from scratch. Prior to this patch, a commit-graph write with '--changed-paths' would compute Bloom filters for all selected commits which haven't already been computed (i.e., by a previous commit-graph write with '--split' such that a roll-up or replacement is performed). This behavior can cause prohibitively-long commit-graph writes for a variety of reasons: * There may be lots of filters whose diffs take a long time to generate (for example, they have close to the maximum number of changes, diffing itself takes a long time, etc). * Old-style commit-graphs (which encode filters with too many entries as not having been computed at all) cause us to waste time recomputing filters that appear to have not been computed only to discover that they are too-large. This can make the upper-bound of the time it takes for 'git commit-graph write --changed-paths' to be rather unpredictable. To make this command behave more predictably, introduce '--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters from scratch. This lets "computing" already-known filters proceed quickly, while bounding the number of slow tasks that Git is willing to do. Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-18commit-graph: rename 'split_commit_graph_opts'Taylor Blau
In the subsequent commit, additional options will be added to the commit-graph API which have nothing to do with splitting. Rename the 'split_commit_graph_opts' structure to the more-generic 'commit_graph_opts' to encompass both. Likewise, rename the 'flags' member to instead be 'split_flags' to clarify that it only has to do with the behavior implied by '--split'. Suggested-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-18bloom: encode out-of-bounds filters as non-emptyTaylor Blau
When a changed-path Bloom filter has either zero, or more than a certain number (commonly 512) of entries, the commit-graph machinery encodes it as "missing". More specifically, it sets the indices adjacent in the BIDX chunk as equal to each other to indicate a "length 0" filter; that is, that the filter occupies zero bytes on disk. This has heretofore been fine, since the commit-graph machinery has no need to care about these filters with too few or too many changed paths. Both cases act like no filter has been generated at all, and so there is no need to store them. In a subsequent commit, however, the commit-graph machinery will learn to only compute Bloom filters for some commits in the current commit-graph layer. This is a change from the current implementation which computes Bloom filters for all commits that are in the layer being written. Critically for this patch, only computing some of the Bloom filters means adding a third state for length 0 Bloom filters: zero entries, too many entries, or "hasn't been computed". It will be important for that future patch to distinguish between "not representable" (i.e., zero or too-many changed paths), and "hasn't been computed". In particular, we don't want to waste time recomputing filters that have already been computed. To that end, change how we store Bloom filters in the "computed but not representable" category: - Bloom filters with no entries are stored as a single byte with all bits low (i.e., all queries to that Bloom filter will return "definitely not") - Bloom filters with too many entries are stored as a single byte with all bits set high (i.e., all queries to that Bloom filter will return "maybe"). These rules are sufficient to not incur a behavior change by changing the on-disk representation of these two classes. Likewise, no specification changes are necessary for the commit-graph format, either: - Filters that were previously empty will be recomputed and stored according to the new rules, and - old clients reading filters generated by new clients will interpret the filters correctly and be none the wiser to how they were generated. Clients will invoke the Bloom machinery in more cases than before, but this can be addressed by returning a NULL filter when all bits are set high. This can be addressed in a future patch. Note that this does increase the size of on-disk commit-graphs, but far less than other proposals. In particular, this is generally more efficient than storing a bitmap for which commits haven't computed their Bloom filters. Storing a bitmap incurs a penalty of one bit per commit, whereas storing explicit filters as above incurs a penalty of one byte per too-large or empty commit. In practice, these boundary commits likely occupy a small proportion of the overall number of commits, and so the size penalty is likely smaller than storing a bitmap for all commits. See, for example, these relative proportions of such boundary commits (collected by SZEDER Gábor): | Percentage of | commit-graph | | | commits modifying | file size | | ├────────┬──────────────┼───────────────────┤ pct. | | 0 path | >= 512 paths | before | after | change | ┌────────────────┼────────┼──────────────┼─────────┼─────────┼───────────┤ | android-base | 13.20% | 0.13% | 37.468M | 37.534M | +0.1741 % | | cmssw | 0.15% | 0.23% | 17.118M | 17.119M | +0.0091 % | | cpython | 3.07% | 0.01% | 7.967M | 7.971M | +0.0423 % | | elasticsearch | 0.70% | 1.00% | 8.833M | 8.835M | +0.0128 % | | gcc | 0.00% | 0.08% | 16.073M | 16.074M | +0.0030 % | | gecko-dev | 0.14% | 0.64% | 59.868M | 59.874M | +0.0105 % | | git | 0.11% | 0.02% | 3.895M | 3.895M | +0.0020 % | | glibc | 0.02% | 0.10% | 3.555M | 3.555M | +0.0021 % | | go | 0.00% | 0.07% | 3.186M | 3.186M | +0.0018 % | | homebrew-cask | 0.40% | 0.02% | 7.035M | 7.035M | +0.0065 % | | homebrew-core | 0.01% | 0.01% | 11.611M | 11.611M | +0.0002 % | | jdk | 0.26% | 5.64% | 5.537M | 5.540M | +0.0590 % | | linux | 0.01% | 0.51% | 63.735M | 63.740M | +0.0073 % | | llvm-project | 0.12% | 0.03% | 25.515M | 25.516M | +0.0050 % | | rails | 0.10% | 0.10% | 6.252M | 6.252M | +0.0027 % | | rust | 0.07% | 0.17% | 9.364M | 9.364M | +0.0033 % | | tensorflow | 0.09% | 1.02% | 7.009M | 7.010M | +0.0158 % | | webkit | 0.05% | 0.31% | 17.405M | 17.406M | +0.0047 % | (where the above increase is determined by computing a non-split commit-graph before and after this patch). Given that these projects are all "large" by commit count, the storage cost by writing these filters explicitly is negligible. In the most extreme example, android-base (which has 494,848 commits at the time of writing) would have its commit-graph increase by a modest 68.4 KB. Finally, a test to exercise filters which contain too many changed path entries will be introduced in a subsequent patch. Suggested-by: SZEDER Gábor <szeder.dev@gmail.com> Suggested-by: Jakub Narębski <jnareb@gmail.com> Helped-by: Derrick Stolee <dstolee@microsoft.com> Helped-by: SZEDER Gábor <szeder.dev@gmail.com> Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17maintenance: add commit-graph taskDerrick Stolee
The first new task in the 'git maintenance' builtin is the 'commit-graph' task. This updates the commit-graph file incrementally with the command git commit-graph write --reachable --split By writing an incremental commit-graph file using the "--split" option we minimize the disruption from this operation. The default behavior is to merge layers until the new "top" layer is less than half the size of the layer below. This provides quick writes most of the time, with the longer writes following a power law distribution. Most importantly, concurrent Git processes only look at the commit-graph-chain file for a very short amount of time, so they will verly likely not be holding a handle to the file when we try to replace it. (This only matters on Windows.) If a concurrent process reads the old commit-graph-chain file, but our job expires some of the .graph files before they can be read, then those processes will see a warning message (but not fail). This could be avoided by a future update to use the --expire-time argument when writing the commit-graph. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17bloom: use provided 'struct bloom_filter_settings'Taylor Blau
When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter from scratch, it looks to the default 'struct bloom_filter_settings' in order to determine the maximum number of changed paths, number of bits per entry, and so on. All of these values have so far been constant, and so there was no need to pass in a pointer from the caller (eg., the one that is stored in the 'struct write_commit_graph_context'). Start passing in a 'struct bloom_filter_settings *' instead of using the default values to respect graph-specific settings (eg., in the case of setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS'). In order to have an initialized value for these settings, move its initialization to earlier in the commit-graph write. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17bloom: split 'get_bloom_filter()' in twoTaylor Blau
'get_bloom_filter' takes a flag to control whether it will compute a Bloom filter if the requested one is missing. In the next patch, we'll add yet another parameter to this method, which would force all but one caller to specify an extra 'NULL' parameter at the end. Instead of doing this, split 'get_bloom_filter' into two functions: 'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only looks up a Bloom filter (and does not compute one if it's missing, thus dropping the 'compute_if_not_present' flag). The latter does compute missing Bloom filters, with an additional parameter to store whether or not it needed to do so. This simplifies many call-sites, since the majority of existing callers to 'get_bloom_filter' do not want missing Bloom filters to be computed (so they can drop the parameter entirely and use the simpler version of the function). While we're at it, instrument the new 'get_or_compute_bloom_filter()' with counters in the 'write_commit_graph_context' struct which store the number of filters that we did and didn't compute, as well as filters that were truncated. It would be nice to drop the 'compute_if_not_present' flag entirely, since all remaining callers of 'get_or_compute_bloom_filter' pass it as '1', but this will change in a future patch and hence cannot be removed. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17commit-graph.c: store maximum changed pathsTaylor Blau
For now, we assume that there is a fixed constant describing the maximum number of changed paths we are willing to store in a Bloom filter. Prepare for that to (at least partially) not be the case by making it a member of the 'struct bloom_filter_settings'. This will be helpful in the subsequent patches by reducing the size of test cases that exercise storing too many changed paths, as well as preparing for an eventual future in which this value might change. This patch alone does not cause newly generated Bloom filters to use a custom upper-bound on the maximum number of changed paths a single Bloom filter can hold, that will occur in a later patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09commit-graph: respect 'commitGraph.readChangedPaths'Taylor Blau
Git uses the 'core.commitGraph' configuration value to control whether or not the commit graph is used when parsing commits or performing a traversal. Now that commit-graphs can also contain a section for changed-path Bloom filters, administrators that already have commit-graphs may find it convenient to use those graphs without relying on their changed-path Bloom filters. This can happen, for example, during a staged roll-out, or in the event of an incident. Introduce 'commitGraph.readChangedPaths' to control whether or not Bloom filters are read. Note that this configuration is independent from both: - 'core.commitGraph', to allow flexibility in using all parts of a commit-graph _except_ for its Bloom filters. - The '--changed-paths' option for 'git commit-graph write', to allow reading and writing Bloom filters to be controlled independently. When the variable is set, pretend as if no Bloom data was specified at all. This avoids adding additional special-casing outside of the commit-graph internals. Suggested-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09commit-graph: pass a 'struct repository *' in more placesTaylor Blau
In a future commit, some commit-graph internals will want access to 'r->settings', but we only have the 'struct object_directory *' corresponding to that repository. Add an additional parameter to pass the repository around in more places. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09commit-graph: introduce 'get_bloom_filter_settings()'Taylor Blau
Many places in the code often need a pointer to the commit-graph's 'struct bloom_filter_settings', in which case they often take the value from the top-most commit-graph. In the non-split case, this works as expected. In the split case, however, things get a little tricky. Not all layers in a chain of incremental commit-graphs are required to themselves have Bloom data, and so whether or not some part of the code uses Bloom filters depends entirely on whether or not the top-most level of the commit-graph chain has Bloom filters. This has been the behavior since Bloom filters were introduced, and has been codified into the tests since a759bfa9ee (t4216: add end to end tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130 requires that Bloom filters are not used in exactly the case described earlier. There is no reason that this needs to be the case, since it is perfectly valid for commits in an earlier layer to have Bloom filters when commits in a newer layer do not. Since Bloom settings are guaranteed in practice to be the same for any layer in a chain that has Bloom data, it is sufficient to traverse the '->base_graph' pointer until either (1) a non-null 'struct bloom_filter_settings *' is found, or (2) until we are at the root of the commit-graph chain. Introduce a 'get_bloom_filter_settings()' function that does just this, and use it instead of purely dereferencing the top-most graph's '->bloom_filter_settings' pointer. While we're at it, add an additional test in t5324 to guard against code in the commit-graph writing machinery that doesn't correctly handle a NULL 'struct bloom_filter *'. Co-authored-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17commit-graph: use the "hash version" byteDerrick Stolee
The commit-graph format reserved a byte among the header of the file to store a "hash version". During the SHA-256 work, this was not modified because file formats are not necessarily intended to work across hash versions. If a repository has SHA-256 as its hash algorithm, it automatically up-shifts the lengths of object names in all necessary formats. However, since we have this byte available for adjusting the version, we can make the file formats more obviously incompatible instead of relying on other context from the repository. Update the oid_version() method in commit-graph.c to add a new value, 2, for sha-256. This automatically writes the new value in a SHA-256 repository _and_ verifies the value is correct. This is a breaking change relative to the current 'master' branch since 092b677 (Merge branch 'bc/sha-256-cvs-svn-updates', 2020-08-13) but it is not breaking relative to any released version of Git. The test impact is relatively minor: the output of 'test-tool read-graph' lists the header information, so those instances of '1' need to be replaced with a variable determined by GIT_TEST_DEFAULT_HASH. A more careful test is added that specifically creates a repository of each type then swaps the commit-graph files. The important value here is that the "git log" command succeeds while writing a message to stderr. Helped-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-30Merge branch 'ds/commit-graph-bloom-updates' into masterJunio C Hamano
Updates to the changed-paths bloom filter. * ds/commit-graph-bloom-updates: commit-graph: check all leading directories in changed path Bloom filters revision: empty pathspecs should not use Bloom filters revision.c: fix whitespace commit-graph: check chunk sizes after writing commit-graph: simplify chunk writes into loop commit-graph: unify the signatures of all write_graph_chunk_*() functions commit-graph: persist existence of changed-paths bloom: fix logic in get_bloom_filter() commit-graph: change test to die on parse, not load commit-graph: place bloom_settings in context
2020-07-30Merge branch 'sg/commit-graph-cleanups' into masterJunio C Hamano
The changed-path Bloom filter is improved using ideas from an independent implementation. * sg/commit-graph-cleanups: commit-graph: simplify write_commit_graph_file() #2 commit-graph: simplify write_commit_graph_file() #1 commit-graph: simplify parse_commit_graph() #2 commit-graph: simplify parse_commit_graph() #1 commit-graph: clean up #includes diff.h: drop diff_tree_oid() & friends' return value commit-slab: add a function to deep free entries on the slab commit-graph-format.txt: all multi-byte numbers are in network byte order commit-graph: fix parsing the Chunk Lookup table tree-walk.c: don't match submodule entries for 'submod/anything'
2020-07-15Merge branch 'sg/commit-graph-progress-fix' into masterJunio C Hamano
The code to produce progress output from "git commit-graph --write" had a few breakages, which have been fixed. * sg/commit-graph-progress-fix: commit-graph: fix "Writing out commit graph" progress counter commit-graph: fix progress of reachable commits
2020-07-09Merge branch 'tb/fix-persistent-shallow' into masterJunio C Hamano
When "fetch.writeCommitGraph" configuration is set in a shallow repository and a fetch moves the shallow boundary, we wrote out broken commit-graph files that do not match the reality, which has been corrected. * tb/fix-persistent-shallow: commit.c: don't persist substituted parents when unshallowing
2020-07-09commit-graph: fix "Writing out commit graph" progress counterSZEDER Gábor
76ffbca71a (commit-graph: write Bloom filters to commit graph file, 2020-04-06) added two delayed progress lines to writing the Bloom filter index and data chunk. This is wrong, because a single common progress is used while writing all chunks, which is not updated while writing these two new chunks, resulting in incomplete-looking "done" lines: Expanding reachable commits in commit graph: 888679, done. Computing commit changed paths Bloom filters: 100% (888678/888678), done. Writing out commit graph in 6 passes: 66% (3554712/5332068), done. Use the common 'struct progress' instance while writing the Bloom filter chunks as well. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-09commit-graph: fix progress of reachable commitsSZEDER Gábor
To display a progress line while iterating over all refs, d335ce8f24 (commit-graph.c: show progress of finding reachable commits, 2020-05-13) should have added a pair of start_delayed_progress() and stop_progress() calls around a for_each_ref() invocation. Alas, the stop_progress() call ended up at the wrong place, after write_commit_graph(), which does all the commit-graph computation and writing, and has several progress lines of its own. Consequently, that new Collecting referenced commits: 123 progress line is overwritten by the first progress line shown by write_commit_graph(), and its final "done" line is shown last, after everything is finished: Expanding reachable commits in commit graph: 344786, done. Computing commit changed paths Bloom filters: 100% (344786/344786), done. Collecting referenced commits: 154, done. Move that stop_progress() call to the right place. While at it, drop the unnecessary 'if (data.progress)' condition protecting the stop_progress() call, because that function is prepared to handle a NULL progress struct. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-08commit.c: don't persist substituted parents when unshallowingTaylor Blau
Since 37b9dcabfc (shallow.c: use '{commit,rollback}_shallow_file', 2020-04-22), Git knows how to reset stat-validity checks for the $GIT_DIR/shallow file, allowing it to change between a shallow and non-shallow state in the same process (e.g., in the case of 'git fetch --unshallow'). However, when $GIT_DIR/shallow changes, Git does not alter or remove any grafts (nor substituted parents) in memory. This comes up in a "git fetch --unshallow" with fetch.writeCommitGraph set to true. Ordinarily in a shallow repository (and before 37b9dcabfc, even in this case), commit_graph_compatible() would return false, indicating that the repository should not be used to write a commit-graphs (since commit-graph files cannot represent a shallow history). But since 37b9dcabfc, in an --unshallow operation that check succeeds. Thus even though the repository isn't shallow any longer (that is, we have all of the objects), the in-core representation of those objects still has munged parents at the shallow boundaries. When the commit-graph write proceeds, we use the incorrect parentage, producing wrong results. There are two ways for a user to work around this: either (1) set 'fetch.writeCommitGraph' to 'false', or (2) drop the commit-graph after unshallowing. One way to fix this would be to reset the parsed object pool entirely (flushing the cache and thus preventing subsequent reads from modifying their parents) after unshallowing. That would produce a problem when callers have a now-stale reference to the old pool, and so this patch implements a different approach. Instead, attach a new bit to the pool, 'substituted_parent', which indicates if the repository *ever* stored a commit which had its parents modified (i.e., the shallow boundary prior to unshallowing). This bit needs to be sticky because all reads subsequent to modifying a commit's parents are unreliable when unshallowing. Modify the check in 'commit_graph_compatible' to take this bit into account, and correctly avoid generating commit-graphs in this case, thus solving the bug. Helped-by: Derrick Stolee <dstolee@microsoft.com> Helped-by: Jonathan Nieder <jrnieder@gmail.com> Reported-by: Jay Conrod <jayconrod@google.com> Reviewed-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01commit-graph: check chunk sizes after writingSZEDER Gábor
In my experience while experimenting with new commit-graph chunks, early versions of the corresponding new write_commit_graph_my_chunk() functions are, sadly but not surprisingly, often buggy, and write more or less data than they are supposed to, especially if the chunk size is not directly proportional to the number of commits. This then causes all kinds of issues when reading such a bogus commit-graph file, raising the question of whether the writing or the reading part happens to be buggy this time. Let's catch such issues early, already when writing the commit-graph file, and check that each write_graph_chunk_*() function wrote the amount of data that it was expected to, and what has been encoded in the Chunk Lookup table. Now that all commit-graph chunks are written in a loop we can do this check in a single place for all chunks, and any chunks added in the future will get checked as well. Helped-by: René Scharfe <l.s.r@web.de> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01commit-graph: simplify chunk writes into loopSZEDER Gábor
In write_commit_graph_file() we now have one block of code filling the array of 'struct chunk_info' with the IDs and sizes of chunks to be written, and an other block of code calling the functions responsible for writing individual chunks. In case of optional chunks like Extra Edge List an Base Graphs List there is also a condition checking whether that chunk is necessary/desired, and that same condition is repeated in both blocks of code. Other, newer chunks have similar optional conditions. Eliminate these repeated conditions by storing the function pointers responsible for writing individual chunks in the 'struct chunk_info' array as well, and calling them in a loop to write the commit-graph file. This will open up the possibility for a bit of foolproofing in the following patch. Helped-by: René Scharfe <l.s.r@web.de> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01commit-graph: unify the signatures of all write_graph_chunk_*() functionsSZEDER Gábor
Update the write_graph_chunk_*() helper functions to have the same signature: - Return an int error code from all these functions. write_graph_chunk_base() already has an int error code, now the others will have one, too, but since they don't indicate any error, they will always return 0. - Drop the hash size parameter of write_graph_chunk_oids() and write_graph_chunk_data(); its value can be read directly from 'the_hash_algo' inside these functions as well. This opens up the possibility for further cleanups and foolproofing in the following two patches. Helped-by: René Scharfe <l.s.r@web.de> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01commit-graph: persist existence of changed-pathsDerrick Stolee
The changed-path Bloom filters were released in v2.27.0, but have a significant drawback. A user can opt-in to writing the changed-path filters using the "--changed-paths" option to "git commit-graph write" but the next write will drop the filters unless that option is specified. This becomes even more important when considering the interaction with gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of features.experimental). These config options trigger commit-graph writes that the user did not signal, and hence there is no --changed-paths option available. Allow a user that opts-in to the changed-path filters to persist the property of "my commit-graph has changed-path filters" automatically. A user can drop filters using the --no-changed-paths option. In the process, we need to be extremely careful to match the Bloom filter settings as specified by the commit-graph. This will allow future versions of Git to customize these settings, and the version with this change will persist those settings as commit-graphs are rewritten on top. Use the trace2 API to signal the settings used during the write, and check that output in a test after manually adjusting the correct bytes in the commit-graph file. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01bloom: fix logic in get_bloom_filter()Derrick Stolee
The get_bloom_filter() method is a bit complicated in some parts where it does not need to be. In particular, it needs to return a NULL filter only when compute_if_not_present is zero AND the filter data cannot be loaded from a commit-graph file. This currently happens by accident because the commit-graph does not load changed-path Bloom filters from an existing commit-graph when writing a new one. This will change in a later patch. Also clean up some style issues while we are here. One side-effect of returning a NULL filter is that the filters that are reported as "too large" will now be reported as NULL insead of length zero. This case was not properly covered before, so add a test. Further, remote the counting of the zero-length filters from revision.c and the trace2 logs. Helped-by: René Scharfe <l.s.r@web.de> Helped-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-24commit-graph: change test to die on parse, not loadDerrick Stolee
43d3561 (commit-graph write: don't die if the existing graph is corrupt, 2019-03-25) introduced the GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD environment variable. This was created to verify that commit-graph was not loaded when writing a new non-incremental commit-graph. An upcoming change wants to load a commit-graph in some valuable cases, but we want to maintain that we don't trust the commit-graph data when writing our new file. Instead of dying on load, instead die if we ever try to parse a commit from the commit-graph. This functionally verifies the same intended behavior, but allows a more advanced feature in the next change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-24commit-graph: place bloom_settings in contextDerrick Stolee
Place an instance of struct bloom_settings into the struct write_commit_graph_context. This allows simplifying the function prototype of write_graph_chunk_bloom_data(). This will allow us to combine the function prototypes and use function pointers to simplify write_commit_graph_file(). By using a pointer, we can later replace the settings to match those that exist in the current commit-graph, in case a future Git version allows customization of these parameters. Reported-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-17commit-graph: minimize commit_graph_data_slab accessAbhishek Kumar
In an earlier patch, multiple struct acccesses to `graph_pos` and `generation` were auto-converted to multiple method calls. Since the values are fixed and commit-slab access costly, we would be better off with storing the values as a local variable and reusing it. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-17commit: move members graph_pos, generation to a slabAbhishek Kumar
We remove members `graph_pos` and `generation` from the struct commit. The default assignments in init_commit_node() are no longer valid, which is fine as the slab helpers return appropriate default values and the assignments are removed. We will replace existing use of commit->generation and commit->graph_pos by commit_graph_data_slab helpers using `contrib/coccinelle/commit.cocci'. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-17commit-graph: introduce commit_graph_data_slabAbhishek Kumar
The struct commit is used in many contexts. However, members `generation` and `graph_pos` are only used for commit-graph related operations and otherwise waste memory. This wastage would have been more pronounced as we transition to generation number v2, which uses 64-bit generation number instead of current 32-bits. As they are often accessed together, let's introduce struct commit_graph_data and move them to a commit_graph_data slab. While the overall test suite runs just as fast as master, (series: 26m48s, master: 27m34s, faster by 2.87%), certain commands like `git merge-base --is-ancestor` were slowed by 40% as discovered by Szeder Gábor [1]. After minimizing commit-slab access, the slow down persists but is closer to 20%. Derrick Stolee believes the slow down is attributable to the underlying algorithm rather than the slowness of commit-slab access [2] and we will follow-up in a later series. [1]: https://lore.kernel.org/git/20200607195347.GA8232@szeder.dev/ [2]: https://lore.kernel.org/git/13db757a-9412-7f1e-805c-8a028c4ab2b1@gmail.com/ Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-09Merge branch 'tb/commit-graph-no-check-oids'Junio C Hamano
Clean-up the commit-graph codepath. * tb/commit-graph-no-check-oids: commit-graph: drop COMMIT_GRAPH_WRITE_CHECK_OIDS flag t5318: reorder test below 'graph_read_expect' commit-graph.c: simplify 'fill_oids_from_commits' builtin/commit-graph.c: dereference tags in builtin builtin/commit-graph.c: extract 'read_one_commit()' commit-graph.c: peel refs in 'add_ref_to_set' commit-graph.c: show progress of finding reachable commits commit-graph.c: extract 'refs_cb_data'
2020-06-08commit-graph: simplify write_commit_graph_file() #2SZEDER Gábor
Unify the 'chunk_ids' and 'chunk_sizes' arrays into an array of 'struct chunk_info'. This will allow more cleanups in the following patches. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08commit-graph: simplify write_commit_graph_file() #1SZEDER Gábor
In write_commit_graph_file() one block of code fills the array of chunk IDs, another block of code fills the array of chunk offsets, then the chunk IDs and offsets are written to the Chunk Lookup table, and finally a third block of code writes the actual chunks. In case of optional chunks like Extra Edge List and Base Graphs List there is also a condition checking whether that chunk is necessary/desired, and that same condition is repeated in all those three blocks of code. This patch series is about to add more optional chunks, so there would be even more repeated conditions. Those chunk offsets are relative to the beginning of the file, so they inherently depend on the size of the Chunk Lookup table, which in turn depends on the number of chunks that are to be written to the commit-graph file. IOW at the time we set the first chunk's ID we can't yet know its offset, because we don't yet know how many chunks there are. Simplify this by initially filling an array of chunk sizes, not offsets, and calculate the offsets based on the chunk sizes only later, while we are writing the Chunk Lookup table. This way we can fill the arrays of chunk IDs and sizes in one go, eliminating one set of repeated conditions. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08commit-graph: simplify parse_commit_graph() #2SZEDER Gábor
The Chunk Lookup table stores the chunks' starting offset in the commit-graph file, not their sizes. Consequently, the size of a chunk can only be calculated by subtracting its offset from the offset of the subsequent chunk (or that of the terminating label). This is currenly implemented in a bit complicated way: as we iterate over the entries of the Chunk Lookup table, we check the id of each chunk and store its starting offset, then we check the id of the last seen chunk and calculate its size using its previously saved offset. At the moment there is only one chunk for which we calculate its size, but this patch series will add more, and the repeated chunk id checks are not that pretty. Instead let's read ahead the offset of the next chunk on each iteration, so we can calculate the size of each chunk right away, right where we store its starting offset. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08commit-graph: simplify parse_commit_graph() #1SZEDER Gábor
While we iterate over all entries of the Chunk Lookup table we make sure that we don't attempt to read past the end of the mmap-ed commit-graph file, and check in each iteration that the chunk ID and offset we are about to read is still within the mmap-ed memory region. However, these checks in each iteration are not really necessary, because the number of chunks in the commit-graph file is already known before this loop from the just parsed commit-graph header. So let's check that the commit-graph file is large enough for all entries in the Chunk Lookup table before we start iterating over those entries, and drop those per-iteration checks. While at it, take into account the size of everything that is necessary to have a valid commit-graph file, i.e. the size of the header, the size of the mandatory OID Fanout chunk, and the size of the signature in the trailer as well. Note that this necessitates the change of the error message as well, and, consequently, have to update the 'detect incorrect chunk count' test in 't5318-commit-graph.sh' as well. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08commit-graph: clean up #includesSZEDER Gábor
Our CodingGuidelines says that it's sufficient to include one of 'git-compat-util.h' and 'cache.h', but both 'commit-graph.c' and 'commit-graph.h' include both. Let's include only 'git-compat-util.h' to loose a bunch of unnecessary dependencies; but include 'hash.h', because 'commit-graph.h' does require the definition of 'struct object_id'. 'commit-graph.h' explicitly includes 'repository.h' and 'string-list.h', but only needs the declaration of a few structs from them. Drop these includes and forward-declare the necessary structs instead. 'commit-graph.c' includes 'dir.h', but doesn't actually use anything from there, so let's drop that #include as well. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08commit-graph: fix parsing the Chunk Lookup tableSZEDER Gábor
The commit-graph file format specifies that the chunks may be in any order. However, if the OID Lookup chunk happens to be the last one in the file, then any command attempting to access the commit-graph data will fail with: fatal: invalid commit position. commit-graph is likely corrupt In this case the error is wrong, the commit-graph file does conform to the specification, but the parsing of the Chunk Lookup table is a bit buggy, and leaves the field holding the number of commits in the commit-graph zero-initialized. The number of commits in the commit-graph is determined while parsing the Chunk Lookup table, by dividing the size of the OID Lookup chunk with the hash size. However, the Chunk Lookup table doesn't actually store the size of the chunks, but it stores their starting offset. Consequently, the size of a chunk can only be calculated by subtracting the starting offsets of that chunk from the offset of the subsequent chunk, or in case of the last chunk from the offset recorded in the terminating label. This is currenly implemented in a bit complicated way: as we iterate over the entries of the Chunk Lookup table, we check the ID of each chunk and store its starting offset, then we check the ID of the last seen chunk and calculate its size using its previously saved offset if necessary (at the moment it's only necessary for the OID Lookup chunk). Alas, while parsing the Chunk Lookup table we only interate through the "real" chunks, but never look at the terminating label, thus don't even check whether it's necessary to calulate the size of the last chunk. Consequently, if the OID Lookup chunk is the last one, then we don't calculate its size and turn don't run the piece of code determining the number of commits in the commit graph, leaving the field holding that number unchanged (i.e. zero-initialized), eventually triggering the sanity check in load_oid_from_graph(). Fix this by iterating through all entries in the Chunk Lookup table, including the terminating label. Note that this is the minimal fix, suitable for the maintenance track. A better fix would be to simplify how the chunk sizes are calculated, but that is a more invasive change, less suitable for 'maint', so that will be done in later patches. This additional flexibility of scanning more chunks breaks a test for "git commit-graph verify" so alter that test to mutate the commit-graph to have an even lower chunk count. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-18commit-graph: drop COMMIT_GRAPH_WRITE_CHECK_OIDS flagTaylor Blau
Since 7c5c9b9c57 (commit-graph: error out on invalid commit oids in 'write --stdin-commits', 2019-08-05), the commit-graph builtin dies on receiving non-commit OIDs as input to '--stdin-commits'. This behavior can be cumbersome to work around in, say, the case of piping 'git for-each-ref' to 'git commit-graph write --stdin-commits' if the caller does not want to cull out non-commits themselves. In this situation, it would be ideal if 'git commit-graph write' wrote the graph containing the inputs that did pertain to commits, and silently ignored the remainder of the input. Some options have been proposed to the effect of '--[no-]check-oids' which would allow callers to have the commit-graph builtin do just that. After some discussion, it is difficult to imagine a caller who wouldn't want to pass '--no-check-oids', suggesting that we should get rid of the behavior of complaining about non-commit inputs altogether. If callers do wish to retain this behavior, they can easily work around this change by doing the following: git for-each-ref --format='%(objectname) %(objecttype) %(*objecttype)' | awk ' !/commit/ { print "not-a-commit:"$1 } /commit/ { print $1 } ' | git commit-graph write --stdin-commits To make it so that valid OIDs that refer to non-existent objects are indeed an error after loosening the error handling, perform an extra lookup to make sure that object indeed exists before sending it to the commit-graph internals. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-18commit-graph.c: simplify 'fill_oids_from_commits'Taylor Blau
In the previous handful of commits, both 'git commit-graph write --reachable' and '--stdin-commits' learned to peel tags down to the commits which they refer to before passing them into the commit-graph internals. This makes the call to 'lookup_commit_reference_gently()' inside of 'fill_oids_from_commits()' a noop, since all OIDs are commits by that point. As such, remove the call entirely, as well as the progress meter, which has been split and moved out to the callers in the aforementioned earlier commits. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-13commit-graph.c: peel refs in 'add_ref_to_set'Taylor Blau
While iterating references (to discover the set of commits to write to the commit-graph with 'git commit-graph write --reachable'), 'add_ref_to_set' can save 'fill_oids_from_commits()' some time by peeling the references beforehand. Move peeling out of 'fill_oids_from_commits()' and into 'add_ref_to_set()' to use 'peel_ref()' instead of 'deref_tag()'. Doing so allows the commit-graph machinery to use the peeled value from '$GIT_DIR/packed-refs' instead of having to load and parse tags. While we're at it, discard non-commit objects reachable from ref tips. This would be done automatically by 'fill_oids_from_commits()', but such functionality will be removed in a subsequent patch after the call to 'lookup_commit_reference_gently' is dropped (at which point a non-commit object in the commits oidset will become an error). Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-13commit-graph.c: show progress of finding reachable commitsTaylor Blau
When 'git commit-graph write --reachable' is invoked, the commit-graph machinery calls 'for_each_ref()' to discover the set of reachable commits. Right now the 'add_ref_to_set' callback is not doing anything other than adding an OID to the set of known-reachable OIDs. In a subsequent commit, 'add_ref_to_set' will presumptively peel references. This operation should be fast for repositories with an up-to-date '$GIT_DIR/packed-refs', but may be slow in the general case. So that it doesn't appear that 'git commit-graph write' is idling with '--reachable' in the slow case, add a progress meter to provide some output in the meantime. In general, we don't expect a progress meter to appear at all, since peeling references with a 'packed-refs' file is quick. If it's slow and we do show a progress meter, the subsequent 'fill_oids_from_commits()' will be fast, since all of the calls to 'lookup_commit_reference_gently()' will be no-ops. Both progress meters are delayed, so it is unlikely that more than one will appear. In either case, this intermediate state will go away in a handful of patches, at which point there will be at most one progress meter. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-13Merge branch 'tb/shallow-cleanup'Junio C Hamano
Code cleanup. * tb/shallow-cleanup: shallow: use struct 'shallow_lock' for additional safety shallow.h: document '{commit,rollback}_shallow_file' shallow: extract a header file for shallow-related functions commit: make 'commit_graft_pos' non-static
2020-05-08Merge branch 'jt/commit-graph-plug-memleak'Junio C Hamano
Fix a leak noticed by fuzzer. * jt/commit-graph-plug-memleak: commit-graph: avoid memory leaks
2020-05-05Merge branch 'tb/commit-graph-perm-bits'Junio C Hamano
Some of the files commit-graph subsystem keeps on disk did not correctly honor the core.sharedRepository settings and some were left read-write. * tb/commit-graph-perm-bits: commit-graph.c: make 'commit-graph-chain's read-only commit-graph.c: ensure graph layers respect core.sharedRepository commit-graph.c: write non-split graphs as read-only lockfile.c: introduce 'hold_lock_file_for_update_mode' tempfile.c: introduce 'create_tempfile_mode'
2020-05-05commit-graph.c: extract 'refs_cb_data'Taylor Blau
In subsequent patches, we are going to update a progress meter when 'add_ref_to_set()' is called, and need a convenient way to pass a 'struct progress *' in from the caller. Introduce 'refs_cb_data' as a catch-all for parameters that 'add_ref_to_set' may need, and wrap the existing single parameter in that struct. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-04commit-graph: avoid memory leaksJonathan Tan
A fuzzer running on the entry point provided by fuzz-commit-graph.c revealed a memory leak when parse_commit_graph() creates a struct bloom_filter_settings and then returns early due to error. Fix that error by always freeing that struct first (if it exists) before returning early due to error. While making that change, I also noticed another possible memory leak - when the BLOOMDATA chunk is provided but not BLOOMINDEXES. Also fix that error. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-01Merge branch 'ds/blame-on-bloom'Junio C Hamano
"git blame" learns to take advantage of the "changed-paths" Bloom filter stored in the commit-graph file. * ds/blame-on-bloom: test-bloom: check that we have expected arguments test-bloom: fix some whitespace issues blame: drop unused parameter from maybe_changed_path blame: use changed-path Bloom filters tests: write commit-graph with Bloom filters revision: complicated pathspecs disable filters