summaryrefslogtreecommitdiff
path: root/read-cache.c
AgeCommit message (Collapse)Author
2021-12-10Merge branch 'vd/sparse-reset'Junio C Hamano
Various operating modes of "git reset" have been made to work better with the sparse index. * vd/sparse-reset: unpack-trees: improve performance of next_cache_entry reset: make --mixed sparse-aware reset: make sparse-aware (except --mixed) reset: integrate with sparse index reset: expand test coverage for sparse checkouts sparse-index: update command for expand/collapse test reset: preserve skip-worktree bit in mixed reset reset: rename is_missing to !is_in_reset_tree
2021-12-10Merge branch 'vd/sparse-sparsity-fix-on-read'Junio C Hamano
Ensure that the sparseness of the in-core index matches the index.sparse configuration specified by the repository immediately after the on-disk index file is read. * vd/sparse-sparsity-fix-on-read: sparse-index: update do_read_index to ensure correct sparsity sparse-index: add ensure_correct_sparsity function sparse-index: avoid unnecessary cache tree clearing test-read-cache.c: prepare_repo_settings after config init
2021-11-29reset: make sparse-aware (except --mixed)Victoria Dye
Remove `ensure_full_index` guard on `prime_cache_tree` and update `prime_cache_tree_rec` to correctly reconstruct sparse directory entries in the cache tree. While processing a tree's entries, `prime_cache_tree_rec` must determine whether a directory entry is sparse or not by searching for it in the index (*without* expanding the index). If a matching sparse directory index entry is found, no subtrees are added to the cache tree entry and the entry count is set to 1 (representing the sparse directory itself). Otherwise, the tree is assumed to not be sparse and its subtrees are recursively added to the cache tree. Helped-by: Elijah Newren <newren@gmail.com> Signed-off-by: Victoria Dye <vdye@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-25sparse-index: update do_read_index to ensure correct sparsityVictoria Dye
Unless `command_requires_full_index` forces index expansion, ensure in-core index sparsity matches config settings on read by calling `ensure_correct_sparsity`. This makes the behavior of the in-core index more consistent between different methods of updating sparsity: manually changing the `index.sparse` config setting vs. executing `git sparse-checkout --[no-]sparse-index init` Although index sparsity is normally updated with `git sparse-checkout init`, ensuring correct sparsity after a manual `index.sparse` change has some practical benefits: 1. It allows for command-by-command sparsity toggling with `-c index.sparse=<true|false>`, e.g. when troubleshooting issues with the sparse index. 2. It prevents users from experiencing abnormal slowness after setting `index.sparse` to `true` due to use of a full index in all commands until the on-disk index is updated. Helped-by: Junio C Hamano <gitster@pobox.com> Co-authored-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Victoria Dye <vdye@github.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-25Merge branch 'rs/add-dry-run-without-objects'Junio C Hamano
Stop "git add --dry-run" from creating new blob and tree objects. * rs/add-dry-run-without-objects: add: don't write objects with --dry-run
2021-10-18Merge branch 'rs/make-verify-path-really-verify-again'Junio C Hamano
Recent sparse-index work broke safety against attempts to add paths with trailing slashes to the index, which has been corrected. * rs/make-verify-path-really-verify-again: read-cache: let verify_path() reject trailing dir separators again read-cache: add verify_path_internal() t3905: show failure to ignore sub-repo
2021-10-12add: don't write objects with --dry-runRené Scharfe
When the option --dry-run/-n is given, "git add" doesn't change the index, but still writes out new object files. Only hash the latter without writing instead to make the run as dry as possible. Use this opportunity to also make the hash_flags variable unsigned, to match the index_path() parameter it is used as. Reported-by: git.mexon@spamgourmet.com Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-11Merge branch 'sg/test-split-index-fix'Junio C Hamano
Test updates. * sg/test-split-index-fix: read-cache: fix GIT_TEST_SPLIT_INDEX tests: disable GIT_TEST_SPLIT_INDEX for sparse index tests read-cache: look for shared index files next to the index, too t1600-index: disable GIT_TEST_SPLIT_INDEX t1600-index: don't run git commands upstream of a pipe t1600-index: remove unnecessary redirection
2021-10-08read-cache: let verify_path() reject trailing dir separators againRené Scharfe
6e773527b6 (sparse-index: convert from full to sparse, 2021-03-30) made verify_path() accept trailing directory separators for directories, which is necessary for sparse directory entries. This clemency causes "git stash" to stumble over sub-repositories, though, and there may be more unintended side-effects. Avoid them by restoring the old verify_path() behavior and accepting trailing directory separators only in places that are supposed to handle sparse directory entries. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-08read-cache: add verify_path_internal()René Scharfe
Turn verify_path() into an internal function that distinguishes between valid paths and those with trailing directory separators and rename it to verify_path_internal(). Provide a wrapper with the old behavior under the old name. No functional change intended. The new function will be used in the next patch. Suggested-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-06Merge branch 'ab/repo-settings-cleanup'Junio C Hamano
Code cleanup. * ab/repo-settings-cleanup: repository.h: don't use a mix of int and bitfields repo-settings.c: simplify the setup read-cache & fetch-negotiator: check "enum" values in switch() environment.c: remove test-specific "ignore_untracked..." variable wrapper.c: add x{un,}setenv(), and use xsetenv() in environment.c
2021-09-22repo-settings.c: simplify the setupÆvar Arnfjörð Bjarmason
Simplify the setup code in repo-settings.c in various ways, making the code shorter, easier to read, and requiring fewer hacks to do the same thing as it did before: Since 7211b9e7534 (repo-settings: consolidate some config settings, 2019-08-13) we have memset() the whole "settings" structure to -1 in prepare_repo_settings(), and subsequently relied on the -1 value. Most of the fields did not need to be initialized to -1, and because we were doing that we had the enum labels "UNTRACKED_CACHE_UNSET" and "FETCH_NEGOTIATION_UNSET" purely to reflect the resulting state created this memset() in prepare_repo_settings(). No other code used or relied on them, more on that below. For the rest most of the subsequent "are we -1, then read xyz" can simply be removed by re-arranging what we read first. E.g. when setting the "index.version" setting we should have first read "feature.experimental", so that it (and "feature.manyfiles") can provide a default for our "index.version". Instead the code setting it, added when "feature.manyFiles"[1] was created, was using the UPDATE_DEFAULT_BOOL() macro added in an earlier commit[2]. That macro is now gone, since it was only needed for this pattern of reading things in the wrong order. This also fixes an (admittedly obscure) logic error where we'd conflate an explicit "-1" value in the config with our own earlier memset() -1. We can also remove the UPDATE_DEFAULT_BOOL() wrapper added in [3]. Using it is redundant to simply using the return value from repo_config_get_bool(), which is non-zero if the provided key exists in the config. Details on edge cases relating to the memset() to -1, continued from "more on that below" above: * UNTRACKED_CACHE_KEEP: In [4] the "unset" and "keep" handling for core.untrackedCache was consolidated. But it while we understand the "keep" value, we don't handle it differently than the case of any other unknown value. So let's retain UNTRACKED_CACHE_KEEP and remove the UNTRACKED_CACHE_UNSET setting (which was always implicitly UNTRACKED_CACHE_KEEP before). We don't need to inform any code after prepare_repo_settings() that the setting was "unset", as far as anyone else is concerned it's core.untrackedCache=keep. if "core.untrackedcache" isn't present in the config. * FETCH_NEGOTIATION_UNSET & FETCH_NEGOTIATION_NONE: Since these two two enum fields added in [5] don't rely on the memzero() setting them to "-1" anymore we don't have to provide them with explicit values. 1. c6cc4c5afd2 (repo-settings: create feature.manyFiles setting, 2019-08-13) 2. 31b1de6a09b (commit-graph: turn on commit-graph by default, 2019-08-13) 3. 31b1de6a09b (commit-graph: turn on commit-graph by default, 2019-08-13) 4. ad0fb659993 (repo-settings: parse core.untrackedCache, 2019-08-13) 5. aaf633c2ad1 (repo-settings: create feature.experimental setting, 2019-08-13) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-22read-cache & fetch-negotiator: check "enum" values in switch()Ævar Arnfjörð Bjarmason
Change tweak_untracked_cache() in "read-cache.c" to use a switch() to have the compiler assert that we checked all possible values in the "enum untracked_cache_setting" type, and likewise remove the "default" case in fetch_negotiator_init() in favor of checking for "FETCH_NEGOTIATION_UNSET" and "FETCH_NEGOTIATION_NONE". As will be discussed in a subsequent we'll only ever have either of these set to FETCH_NEGOTIATION_NONE, FETCH_NEGOTIATION_UNSET and UNTRACKED_CACHE_UNSET within the prepare_repo_settings() function itself. In preparation for fixing that code let's add a BUG() here to mark this as unreachable code. See ad0fb659993 (repo-settings: parse core.untrackedCache, 2019-08-13) for when the "unset" and "keep" handling for core.untrackedCache was consolidated, and aaf633c2ad1 (repo-settings: create feature.experimental setting, 2019-08-13) for the addition of the "default" pattern in "fetch-negotiator.c". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-08read-cache: fix GIT_TEST_SPLIT_INDEXSZEDER Gábor
Running tests with GIT_TEST_SPLIT_INDEX=1 is supposed to turn on the split index feature and trigger index splitting (mostly) randomly. Alas, this has been broken since 6e37c8ed3c (read-cache.c: fix writing "link" index ext with null base oid, 2019-02-13), and GIT_TEST_SPLIT_INDEX=1 hasn't triggered any index splitting since then. This patch makes GIT_TEST_SPLIT_INDEX work again, though it doesn't restore the pre-6e37c8ed3c behavior. To understand the bug, the fix, and the behavior change we first have to look at how GIT_TEST_SPLIT_INDEX used to work before 6e37c8ed3c: There are two places where we check the value of GIT_TEST_SPLIT_INDEX, and before 6e37c8ed3c they worked like this: 1) In the lower-level do_write_index(), where, if GIT_TEST_SPLIT_INDEX is enabled, we call init_split_index(). This call merely allocates and zero-initializes 'istate->split_index', but does nothing else (i.e. doesn't fill the base/shared index with cache entries, doesn't actually write a shared index file, etc.). Pertinent to this issue, the hash of the base index remains all zeroed out. 2) In the higher-level write_locked_index(), but only when 'istate->split_index' has already been initialized. Then, if GIT_TEST_SPLIT_INDEX is enabled, it randomly sets the flag that triggers index splitting later in this function. This randomness comes from the first byte of the hash of the base index via an 'if ((first_byte & 15) < 6)' condition. However, if 'istate->split_index' hasn't been initialized (i.e. it is still NULL), then write_locked_index() just calls do_write_locked_index(), which internally calls the above mentioned do_write_index(). This means that while GIT_TEST_SPLIT_INDEX=1 usually triggered index splitting randomly, the first two index writes were always deterministic (though I suspect this was unintentional): - The initial index write never splits the index. During the first index write write_locked_index() is called with 'istate->split_index' still uninitialized, so the check in 2) is not executed. It still calls do_write_index(), though, which then executes the check in 1). The resulting all zero base index hash then leads to the 'link' extension being written to '.git/index', though a shared index file is not written: $ rm .git/index $ GIT_TEST_SPLIT_INDEX=1 git update-index --add file $ test-tool dump-split-index .git/index own c6ef71168597caec8553c83d9d0048f1ef416170 base 0000000000000000000000000000000000000000 100644 d00491fd7e5bb6fa28c517a0bb32b8b506539d4d 0 file replacements: deletions: $ ls -l .git/sharedindex.* ls: cannot access '.git/sharedindex.*': No such file or directory - The second index write always splits the index. When the index written in the previous point is read, 'istate->split_index' is initialized because of the presence of the 'link' extension. So during the second write write_locked_index() does run the check in 2), and the first byte of the all zero base index hash always fulfills the randomness condition, which in turn always triggers the index splitting. - Subsequent index writes will find the 'link' extension with a real non-zero base index hash, so from then on the check in 2) is executed and the first byte of the base index hash is as random as it gets (coming from the SHA-1 of index data including timestamps and inodes...). All this worked until 6e37c8ed3c came along, and stopped writing the 'link' extension if the hash of the base index was all zero: $ rm .git/index $ GIT_TEST_SPLIT_INDEX=1 git update-index --add file $ test-tool dump-split-index .git/index own abbd6f6458d5dee73ae8e210ca15a68a390c6fd7 not a split index $ ls -l .git/sharedindex.* ls: cannot access '.git/sharedindex.*': No such file or directory So, since the first index write with GIT_TEST_SPLIT_INDEX=1 doesn't write a 'link' extension, in the second index write 'istate->split_index' remains uninitialized, and the check in 2) is not executed, and ultimately the index is never split. Fix this by modifying write_locked_index() to make sure to check GIT_TEST_SPLIT_INDEX even if 'istate->split_index' is still uninitialized, and initialize it if necessary. The check for GIT_TEST_SPLIT_INDEX and separate init_split_index() call in do_write_index() thus becomes unnecessary, so remove it. Furthermore, add a test to 't1700-split-index.sh' to make sure that GIT_TEST_SPLIT_INDEX=1 will keep working (though only check the index splitting on the first index write, because after that it will be random). Note that this change does not restore the pre-6e37c8ed3c behaviour, as it will deterministically split the index already on the first index write. Since GIT_TEST_SPLIT_INDEX is purely a developer aid, there is no backwards compatibility issue here. The new behaviour did trigger test failures in 't0003-attributes.sh' and 't1600-index.sh', though, which have been fixed in preparatory patches in this series. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Acked-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-08read-cache: look for shared index files next to the index, tooSZEDER Gábor
When reading a split index git always looks for its referenced shared base index in the gitdir of the current repository, even when reading an alternate index specified via GIT_INDEX_FILE, and even when that alternate index file is the "main" '.git/index' file of an other repository. However, if that split index and its referenced shared index files were written by a git command running entirely in that other repository, then, naturally, the shared index file is written to that other repository's gitdir. Consequently, a git command attempting to read that shared index file while running in a different repository won't be able find it and will error out. I'm not sure in what use case it is necessary to read the index of one repository by a git command running in a different repository, but it is certainly possible to do so, and in fact the test 'bare repository: check that --cached honors index' in 't0003-attributes.sh' does exactly that. If GIT_TEST_SPLIT_INDEX=1 were to split the index in just the right moment [1], then this test would indeed fail, because the referenced shared index file could not be found. Let's look for the referenced shared index file not only in the gitdir of the current directory, but, if the shared index is not there, right next to the split index as well. [1] We haven't seen this issue trigger a failure in t0003 yet, because: - While GIT_TEST_SPLIT_INDEX=1 is supposed to trigger index splitting randomly, the first index write has always been deterministic and it has never split the index. - That alternate index file in the other repository is written only once in the entire test script, so it's never split. However, the next patch will fix GIT_TEST_SPLIT_INDEX, and while doing so it will slightly change its behavior to always split the index already on the first index write, and t0003 would always fail without this patch. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Acked-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-08sparse-index: add SPARSE_INDEX_MEMORY_ONLY flagDerrick Stolee
The convert_to_sparse() method checks for the GIT_TEST_SPARSE_INDEX environment variable or the "index.sparse" config setting before converting the index to a sparse one. This is for ease of use since all current consumers are preparing to compress the index before writing it to disk. If these settings are not enabled, then convert_to_sparse() silently returns without doing anything. We will add a consumer in the next change that wants to use the sparse index as an in-memory data structure, regardless of whether the on-disk format should be sparse. To that end, create the SPARSE_INDEX_MEMORY_ONLY flag that will skip these config checks when enabled. All current consumers are modified to pass '0' in the new 'flags' parameter. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-04Merge branch 'ah/plugleaks'Junio C Hamano
Leak plugging. * ah/plugleaks: reset: clear_unpack_trees_porcelain to plug leak builtin/rebase: fix options.strategy memory lifecycle builtin/merge: free found_ref when done builtin/mv: free or UNLEAK multiple pointers at end of cmd_mv convert: release strbuf to avoid leak read-cache: call diff_setup_done to avoid leak ref-filter: also free head for ATOM_HEAD to avoid leak diffcore-rename: move old_dir/new_dir definition to plug leak builtin/for-each-repo: remove unnecessary argv copy to plug leak builtin/submodule--helper: release unused strbuf to avoid leak environment: move strbuf into block to plug leak fmt-merge-msg: free newly allocated temporary strings when done
2021-08-02Merge branch 'jt/bulk-prefetch'Junio C Hamano
"git read-tree" had a codepath where blobs are fetched one-by-one from the promisor remote, which has been corrected to fetch in bulk. * jt/bulk-prefetch: cache-tree: prefetch in partial clone read-tree unpack-trees: refactor prefetching code
2021-07-28Merge branch 'ds/status-with-sparse-index'Junio C Hamano
"git status" codepath learned to work with sparsely populated index without hydrating it fully. * ds/status-with-sparse-index: t1092: document bad sparse-checkout behavior fsmonitor: integrate with sparse index wt-status: expand added sparse directory entries status: use sparse-index throughout status: skip sparse-checkout percentage with sparse-index diff-lib: handle index diffs with sparse dirs dir.c: accept a directory as part of cone-mode patterns unpack-trees: unpack sparse directory entries unpack-trees: rename unpack_nondirectories() unpack-trees: compare sparse directories correctly unpack-trees: preserve cache_bottom t1092: add tests for status/add and sparse files t1092: expand repository data shape t1092: replace incorrect 'echo' with 'cat' sparse-index: include EXTENDED flag when expanding sparse-index: skip indexes with unmerged entries
2021-07-26read-cache: call diff_setup_done to avoid leakAndrzej Hunt
repo_diff_setup() calls through to diff.c's static prep_parse_options(), which in turn allocates a new array into diff_opts.parseopts. diff_setup_done() is responsible for freeing that array, and has the benefit of verifying diff_opts too - hence we add a call to diff_setup_done() to avoid leaking parseopts. Output from the leak as found while running t0090 with LSAN: Direct leak of 7120 byte(s) in 1 object(s) allocated from: #0 0x49a82d in malloc ../projects/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3 #1 0xa8bf89 in do_xmalloc wrapper.c:41:8 #2 0x7a7bae in prep_parse_options diff.c:5636:2 #3 0x7a7bae in repo_diff_setup diff.c:4611:2 #4 0x93716c in repo_index_has_changes read-cache.c:2518:3 #5 0x872233 in unclean merge-ort-wrappers.c:12:14 #6 0x872233 in merge_ort_recursive merge-ort-wrappers.c:53:6 #7 0x5d5b11 in try_merge_strategy builtin/merge.c:752:12 #8 0x5d0b6b in cmd_merge builtin/merge.c:1666:9 #9 0x4ce83e in run_builtin git.c:475:11 #10 0x4ccafe in handle_builtin git.c:729:3 #11 0x4cb01c in run_argv git.c:818:4 #12 0x4cb01c in cmd_main git.c:949:19 #13 0x6bdc2d in main common-main.c:52:11 #14 0x7f551eb51349 in __libc_start_main (/lib64/libc.so.6+0x24349) SUMMARY: AddressSanitizer: 7120 byte(s) leaked in 1 allocation(s) Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-23unpack-trees: refactor prefetching codeJonathan Tan
Refactor the prefetching code in unpack-trees.c into its own function, because it will be used elsewhere in a subsequent commit. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-17Merge branch 'ew/mmap-failures'Junio C Hamano
Error message update. * ew/mmap-failures: xmmap: inform Linux users of tuning knobs on ENOMEM
2021-07-14status: use sparse-index throughoutDerrick Stolee
By testing 'git -c core.fsmonitor= status -uno', we can check for the simplest index operations that can be made sparse-aware. The necessary implementation details are already integrated with sparse-checkout, so modify command_requires_full_index to be zero for cmd_status(). In refresh_index(), we loop through the index entries to refresh their stat() information. However, sparse directories have no stat() information to populate. Ignore these entries. This allows 'git status' to no longer expand a sparse index to a full one. This is further tested by dropping the "-uno" option and adding an untracked file into the worktree. The performance test p2000-sparse-checkout-operations.sh demonstrates these improvements: Test HEAD~1 HEAD ----------------------------------------------------------------------------- 2000.2: git status (full-index-v3) 0.31(0.30+0.05) 0.31(0.29+0.06) +0.0% 2000.3: git status (full-index-v4) 0.31(0.29+0.07) 0.34(0.30+0.08) +9.7% 2000.4: git status (sparse-index-v3) 2.35(2.28+0.10) 0.04(0.04+0.05) -98.3% 2000.5: git status (sparse-index-v4) 2.35(2.24+0.15) 0.05(0.04+0.06) -97.9% Note that since HEAD~1 was expanding the sparse index by parsing trees, it was artificially slower than the full index case. Thus, the 98% improvement is misleading, and instead we should celebrate the 0.34s to 0.05s improvement of 85%. This is more indicative of the peformance gains we are expecting by using a sparse index. Note: we are dropping the assignment of core.fsmonitor here. This is not necessary for the test script as we are not altering the config any other way. Correct integration with FS Monitor will be validated in later changes. Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-08Merge branch 'ab/progress-cleanup'Junio C Hamano
Code clean-up. * ab/progress-cleanup: read-cache.c: don't guard calls to progress.c API
2021-06-30xmmap: inform Linux users of tuning knobs on ENOMEMEric Wong
Linux users may benefit from additional information on how to avoid ENOMEM from mmap despite the system having enough RAM to accomodate them. We can't reliably unmap pack windows to work around the issue since malloc and other library routines may mmap without our knowledge. Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-08read-cache.c: don't guard calls to progress.c APIÆvar Arnfjörð Bjarmason
Don't guard the calls to the progress.c API with "if (progress)". The API itself will check this. This doesn't change any behavior, but makes this code consistent with the rest of the codebase. See ae9af12287b (status: show progress bar if refreshing the index takes too long, 2018-09-15) for the commit that added the pattern we're changing here. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19read-cache: delete unused hashing methodsDerrick Stolee
These methods were marked as MAYBE_UNUSED in the previous change to avoid a complicated diff. Delete them entirely, since we now use the hashfile API instead of this custom hashing code. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19read-cache: use hashfile instead of git_hash_ctxDerrick Stolee
The do_write_index() method in read-cache.c has its own hashing logic and buffering mechanism. Specifically, the ce_write() method was introduced by 4990aadc (Speed up index file writing by chunking it nicely, 2005-04-20) and similar mechanisms were introduced a few months later in c38138cd (git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26). Based on the timing, in the early days of the Git codebase, I figured that these roughly equivalent code paths were never unified only because it got lost in the shuffle. The hashfile API has since been used extensively in other file formats, such as pack-indexes, multi-pack-indexes, and commit-graphs. Therefore, it seems prudent to unify the index writing code to use the same mechanism. I discovered this disparity while trying to create a new index format that uses the chunk-format API. That API uses a hashfile as its base, so it is incompatible with the custom code in read-cache.c. This rewrite is rather straightforward. It replaces all writes to the temporary file with writes to the hashfile struct. This takes care of many of the direct interactions with the_hash_algo. There are still some git_hash_ctx uses remaining: the extension headers are hashed for use in the End of Index Entries (EOIE) extension. This use of the git_hash_ctx is left as-is. There are multiple reasons to not use a hashfile here, including the fact that the data is not actually writing to a file, just a hash computation. These hashes do not block our adoption of the chunk-format API in a future change to the index, so leave it as-is. The internals of the algorithms are mostly identical. Previously, the hashfile API used a smaller 8KB buffer instead of the 128KB buffer from read-cache.c. The previous change already unified these sizes. There is one subtle point: we do not pass the CSUM_FSYNC to the finalize_hashfile() method, which differs from most consumers of the hashfile API. The extra fsync() call indicated by this flag causes a significant peformance degradation that is noticeable for quick commands that write the index, such as "git add". Other consumers can absorb this cost with their more complicated data structure organization, and further writing structures such as pack-files and commit-graphs is rarely in the critical path for common user interactions. Some static methods become orphaned in this diff, so I marked them as MAYBE_UNUSED. The diff is much harder to read if they are deleted during this change. Instead, they will be deleted in the following change. In addition to the test suite passing, I computed indexes using the previous binaries and the binaries compiled after this change, and found the index data to be exactly equal. Finally, I did extensive performance testing of "git update-index --force-write" on repos of various sizes, including one with over 2 million paths at HEAD. These tests demonstrated less than 1% difference in behavior. As expected, the performance should be considered unchanged. The previous changes to increase the hashfile buffer size from 8K to 128K ensured this change would not create a peformance regression. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-16Merge branch 'mt/parallel-checkout-part-3'Junio C Hamano
The final part of "parallel checkout". * mt/parallel-checkout-part-3: ci: run test round with parallel-checkout enabled parallel-checkout: add tests related to .gitattributes t0028: extract encoding helpers to lib-encoding.sh parallel-checkout: add tests related to path collisions parallel-checkout: add tests for basic operations checkout-index: add parallel checkout support builtin/checkout.c: complete parallel checkout support make_transient_cache_entry(): optionally alloc from mem_pool
2021-05-10Merge branch 'bc/hash-transition-interop-part-1'Junio C Hamano
SHA-256 transition. * bc/hash-transition-interop-part-1: hex: print objects using the hash algorithm member hex: default to the_hash_algo on zero algorithm value builtin/pack-objects: avoid using struct object_id for pack hash commit-graph: don't store file hashes as struct object_id builtin/show-index: set the algorithm for object IDs hash: provide per-algorithm null OIDs hash: set, copy, and use algo field in struct object_id builtin/pack-redundant: avoid casting buffers to struct object_id Use the final_oid_fn to finalize hashing of object IDs hash: add a function to finalize object IDs http-push: set algorithm when reading object ID Always use oidread to read into struct object_id hash: add an algo member to struct object_id
2021-05-07Merge branch 'mt/add-rm-in-sparse-checkout'Junio C Hamano
"git add" and "git rm" learned not to touch those paths that are outside of sparse checkout. * mt/add-rm-in-sparse-checkout: rm: honor sparse checkout patterns add: warn when asked to update SKIP_WORKTREE entries refresh_index(): add flag to ignore SKIP_WORKTREE entries pathspec: allow to ignore SKIP_WORKTREE entries on index matching add: make --chmod and --renormalize honor sparse checkouts t3705: add tests for `git add` in sparse checkouts add: include magic part of pathspec on --refresh error
2021-05-07Merge branch 'ad/cygwin-no-backslashes-in-paths'Junio C Hamano
Cygwin pathname handling fix. * ad/cygwin-no-backslashes-in-paths: cygwin: disallow backslashes in file names
2021-05-05make_transient_cache_entry(): optionally alloc from mem_poolMatheus Tavares
Allow make_transient_cache_entry() to optionally receive a mem_pool struct in which it should allocate the entry. This will be used in the following patch, to store some transient entries which should persist until parallel checkout finishes. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-30Merge branch 'ds/sparse-index-protections'Junio C Hamano
Builds on top of the sparse-index infrastructure to mark operations that are not ready to mark with the sparse index, causing them to fall back on fully-populated index that they always have worked with. * ds/sparse-index-protections: (47 commits) name-hash: use expand_to_path() sparse-index: expand_to_path() name-hash: don't add directories to name_hash revision: ensure full index resolve-undo: ensure full index read-cache: ensure full index pathspec: ensure full index merge-recursive: ensure full index entry: ensure full index dir: ensure full index update-index: ensure full index stash: ensure full index rm: ensure full index merge-index: ensure full index ls-files: ensure full index grep: ensure full index fsck: ensure full index difftool: ensure full index commit: ensure full index checkout: ensure full index ...
2021-04-30cygwin: disallow backslashes in file namesAdam Dinwoodie
The backslash character is not a valid part of a file name on Windows. If, in Windows, Git attempts to write a file that has a backslash character in the filename, it will be incorrectly interpreted as a directory separator. This caused CVE-2019-1354 in MinGW, as this behaviour can be manipulated to cause the checkout to write to files it ought not write to, such as adding code to the .git/hooks directory. This was fixed by e1d911dd4c (mingw: disallow backslash characters in tree objects' file names, 2019-09-12). However, the vulnerability also exists in Cygwin: while Cygwin mostly provides a POSIX-like path system, it will still interpret a backslash as a directory separator. To avoid this vulnerability, CVE-2021-29468, extend the previous fix to also apply to Cygwin. Similarly, extend the test case added by the previous version of the commit. The test suite doesn't have an easy way to say "run this test if in MinGW or Cygwin", so add a new test prerequisite that covers both. As well as checking behaviour in the presence of paths containing backslashes, the existing test also checks behaviour in the presence of paths that differ only by the presence of a trailing ".". MinGW follows normal Windows application behaviour and treats them as the same path, but Cygwin more closely emulates *nix systems (at the expense of compatibility with native Windows applications) and will create and distinguish between such paths. Gate the relevant bit of that test accordingly. Reported-by: RyotaK <security@ryotak.me> Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Adam Dinwoodie <adam@dinwoodie.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27Always use oidread to read into struct object_idbrian m. carlson
In the future, we'll want oidread to automatically set the hash algorithm member for an object ID we read into it, so ensure we use oidread instead of hashcpy everywhere we're copying a hash value into a struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-14read-cache: ensure full indexDerrick Stolee
Before iterating over all cache entries, ensure that a sparse index is expanded to a full index to avoid unexpected behavior. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-14read-cache: expand on query into sparse-directory entryDerrick Stolee
Callers to index_name_pos() or index_name_stage_pos() have a specific path in mind. If that happens to be a path with an ancestor being a sparse-directory entry, it can lead to unexpected results. In the case that we did not find the requested path, check to see if the position _before_ the inserted position is a sparse directory entry that matches the initial segment of the input path (including the directory separator at the end of the directory name). If so, then expand the index to be a full index and search again. This expansion will only happen once per index read. Future enhancements could be more careful to expand only the necessary sparse directory entry, but then we would have a special "not fully sparse, but also not fully expanded" mode that could affect writing the index to file. Since this only occurs if a specific file is requested outside of the sparse checkout definition, this is unlikely to be a common situation. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-14*: remove 'const' qualifier for struct index_stateDerrick Stolee
Several methods specify that they take a 'struct index_state' pointer with the 'const' qualifier because they intend to only query the data, not change it. However, we will be introducing a step very low in the method stack that might modify a sparse-index to become a full index in the case that our queries venture inside a sparse-directory entry. This change only removes the 'const' qualifiers that are necessary for the following change which will actually modify the implementation of index_name_stage_pos(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-08refresh_index(): add flag to ignore SKIP_WORKTREE entriesMatheus Tavares
refresh_index() doesn't update SKIP_WORKTREE entries, but it still matches them against the given pathspecs, marks the matches on the seen[] array, check if unmerged, etc. In the following patch, one caller will need refresh_index() to ignore SKIP_WORKTREE entries entirely, so add a flag that implements this behavior. While we are here, also realign the REFRESH_* flags and convert the hex values to the more natural bit shift format, which makes it easier to spot holes. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-30sparse-index: convert from full to sparseDerrick Stolee
If we have a full index, then we can convert it to a sparse index by replacing directories outside of the sparse cone with sparse directory entries. The convert_to_sparse() method does this, when the situation is appropriate. For now, we avoid converting the index to a sparse index if: 1. the index is split. 2. the index is already sparse. 3. sparse-checkout is disabled. 4. sparse-checkout does not use cone mode. Finally, we currently limit the conversion to when the GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git config will be added in a later change. The trickiest thing about this conversion is that we might not be able to mark a directory as a sparse directory just because it is outside the sparse cone. There might be unmerged files within that directory, so we need to look for those. Also, if there is some strange reason why a file is not marked with CE_SKIP_WORKTREE, then we should give up on converting that directory. There is still hope that some of its subdirectories might be able to convert to sparse, so we keep looking deeper. The conversion process is assisted by the cache-tree extension. This is calculated from the full index if it does not already exist. We then abandon the cache-tree as it no longer applies to the newly-sparse index. Thus, this cache-tree will be recalculated in every sparse-full-sparse round-trip until we integrate the cache-tree extension with the sparse index. Some Git commands use the index after writing it. For example, 'git add' will update the index, then write it to disk, then read its entries to report information. To keep the in-memory index in a full state after writing, we re-expand it to a full one after the write. This is wasteful for commands that only write the index and do not read from it again, but that is only the case until we make those commands "sparse aware." We can compare the behavior of the sparse-index in t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1 when operating on the 'sparse-index' repo. We can also compare the two sparse repos directly, such as comparing their indexes (when expanded to full in the case of the 'sparse-index' repo). We also verify that the index is actually populated with sparse directory entries. The 'checkout and reset (mixed)' test is marked for failure when comparing a sparse repo to a full repo, but we can compare the two sparse-checkout cases directly to ensure that we are not changing the behavior when using a sparse index. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-30sparse-index: add 'sdir' index extensionDerrick Stolee
The index format does not currently allow for sparse directory entries. This violates some expectations that older versions of Git or third-party tools might not understand. We need an indicator inside the index file to warn these tools to not interact with a sparse index unless they are aware of sparse directory entries. Add a new _required_ index extension, 'sdir', that indicates that the index may contain sparse directory entries. This allows us to continue to use the differences in index formats 2, 3, and 4 before we create a new index version 5 in a later change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-30sparse-index: implement ensure_full_index()Derrick Stolee
We will mark an in-memory index_state as having sparse directory entries with the sparse_index bit. These currently cannot exist, but we will add a mechanism for collapsing a full index to a sparse one in a later change. That will happen at write time, so we must first allow parsing the format before writing it. Commands or methods that require a full index in order to operate can call ensure_full_index() to expand that index in-memory. This requires parsing trees using that index's repository. Sparse directory entries have a specific 'ce_mode' value. The macro S_ISSPARSEDIR(ce->ce_mode) can check if a cache_entry 'ce' has this type. This ce_mode is not possible with the existing index formats, so we don't also verify all properties of a sparse-directory entry, which are: 1. ce->ce_mode == 0040000 2. ce->flags & CE_SKIP_WORKTREE is true 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator) 4. ce->oid references a tree object. These are all semi-enforced in ensure_full_index() to some extent. Any deviation will cause a warning at minimum or a failure in the worst case. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-19Merge branch 'rs/calloc-array'Junio C Hamano
CALLOC_ARRAY() macro replaces many uses of xcalloc(). * rs/calloc-array: cocci: allow xcalloc(1, size) use CALLOC_ARRAY git-compat-util.h: drop trailing semicolon from macro definition
2021-03-19Merge branch 'js/fsmonitor-unpack-fix'Junio C Hamano
The data structure used by fsmonitor interface was not properly duplicated during an in-core merge, leading to use-after-free etc. * js/fsmonitor-unpack-fix: fsmonitor: do not forget to release the token in `discard_index()` fsmonitor: fix memory corruption in some corner cases
2021-03-17fsmonitor: do not forget to release the token in `discard_index()`Johannes Schindelin
In 56c6910028a (fsmonitor: change last update timestamp on the index_state to opaque token, 2020-01-07), we forgot to adjust `discard_index()` to release the "last-update" token: it is no longer a 64-bit number, but a free-form string that has been allocated. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-14use CALLOC_ARRAYRené Scharfe
Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-01Merge branch 'ns/raise-write-index-buffer-size'Junio C Hamano
Raise the buffer size used when writing the index file out from (obviously too small) 8kB to (clearly sufficiently large) 128kB. * ns/raise-write-index-buffer-size: read-cache: make the index write buffer size 128K
2021-03-01Merge branch 'jh/fsmonitor-prework'Junio C Hamano
Preliminary changes to fsmonitor integration. * jh/fsmonitor-prework: fsmonitor: refactor initialization of fsmonitor_last_update token fsmonitor: allow all entries for a folder to be invalidated fsmonitor: log FSMN token when reading and writing the index fsmonitor: log invocation of FSMonitor hook to trace2 read-cache: log the number of scanned files to trace2 read-cache: log the number of lstat calls to trace2 preload-index: log the number of lstat calls to trace2 p7519: add trace logging during perf test p7519: move watchman cleanup earlier in the test p7519: fix watchman watch-list test on Windows p7519: do not rely on "xargs -d" in test
2021-02-24read-cache: make the index write buffer size 128KNeeraj Singh
Writing an index 8K at a time invokes the OS filesystem and caching code very frequently, introducing noticeable overhead while writing large indexes. When experimenting with different write buffer sizes on Windows writing the Windows OS repo index (260MB), most of the benefit came by bumping the index write buffer size to 64K. I picked 128K to ensure that we're past the knee of the curve. With this change, the time under do_write_index for an index with 3M files goes from ~1.02s to ~0.72s. Signed-off-by: Neeraj Singh <neerajsi@ntdev.microsoft.com> Acked-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>