summaryrefslogtreecommitdiff
path: root/dir.c
AgeCommit message (Collapse)Author
2020-11-21Merge branch 'en/strmap'Junio C Hamano
A specialization of hashmap that uses a string as key has been introduced. Hopefully it will see wider use over time. * en/strmap: shortlog: use strset from strmap.h Use new HASHMAP_INIT macro to simplify hashmap initialization strmap: take advantage of FLEXPTR_ALLOC_STR when relevant strmap: enable allocations to come from a mem_pool strmap: add a strset sub-type strmap: split create_entry() out of strmap_put() strmap: add functions facilitating use as a string->int map strmap: enable faster clearing and reusing of strmaps strmap: add more utility functions strmap: new utility functions hashmap: provide deallocation function names hashmap: introduce a new hashmap_partial_clear() hashmap: allow re-use after hashmap_free() hashmap: adjust spacing to fix argument alignment hashmap: add usage documentation explaining hashmap_free[_entries]()
2020-11-02Merge branch 'nk/dir-c-comment-update'Junio C Hamano
Update stale in-code comment. * nk/dir-c-comment-update: dir.c: fix comments to agree with argument name
2020-11-02hashmap: provide deallocation function namesElijah Newren
hashmap_free(), hashmap_free_entries(), and hashmap_free_() have existed for a while, but aren't necessarily the clearest names, especially with hashmap_partial_clear() being added to the mix and lazy-initialization now being supported. Peff suggested we adopt the following names[1]: - hashmap_clear() - remove all entries and de-allocate any hashmap-specific data, but be ready for reuse - hashmap_clear_and_free() - ditto, but free the entries themselves - hashmap_partial_clear() - remove all entries but don't deallocate table - hashmap_partial_clear_and_free() - ditto, but free the entries This patch provides the new names and converts all existing callers over to the new naming scheme. [1] https://lore.kernel.org/git/20201030125059.GA3277724@coredump.intra.peff.net/ Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-16dir.c: fix comments to agree with argument nameAlex Vandiver
Signed-off-by: Alex Vandiver <alexmv@dropbox.com> Signed-off-by: Nipunn Koorapati <nipunn@dropbox.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-30dir.c: drop unused "untracked" from treat_path_fast()Jeff King
We don't use the untracked_cache_dir parameter that is passed in, but instead look at the untracked_cache_dir inside the cached_dir struct we are passed. It's been this way since the introduction of treat_path_fast() in 91a2288b5f (untracked cache: record/validate dir mtime and reuse cached output, 2015-03-08). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-27Merge branch 'jk/leakfix'Junio C Hamano
Code clean-up. * jk/leakfix: submodule--helper: fix leak of core.worktree value config: fix leak in git_config_get_expiry_in_days() config: drop git_config_get_string_const() config: fix leaks from git_config_get_string_const() checkout: fix leak of non-existent branch names submodule--helper: use strbuf_release() to free strbufs clear_pattern_list(): clear embedded hashmaps
2020-08-24Merge branch 'en/dir-clear'Junio C Hamano
Leakfix with code clean-up. * en/dir-clear: dir: fix problematic API to avoid memory leaks dir: make clear_directory() free all relevant memory
2020-08-24Merge branch 'en/dir-nonbare-embedded'Junio C Hamano
"ls-files -o" mishandled the top-level directory of another git working tree that hangs in the current git working tree. * en/dir-nonbare-embedded: dir: avoid prematurely marking nonbare repositories as matches t3000: fix some test description typos
2020-08-19dir: fix problematic API to avoid memory leaksElijah Newren
The dir structure seemed to have a number of leaks and problems around it. First I noticed that parent_hashmap and recursive_hashmap were being leaked (though Peff noticed and submitted fixes before me). Then I noticed in the previous commit that clear_directory() was only taking responsibility for a subset of fields within dir_struct, despite the fact that entries[] and ignored[] we allocated internally to dir.c. That, of course, resulted in many callers either leaking or haphazardly trying to free these arrays and their contents. Digging further, I found that despite the pretty clear documentation near the top of dir.h that folks were supposed to call clear_directory() when the user no longer needed the dir_struct, there were four callers that didn't bother doing that at all. However, two of them clearly thought about leaks since they had an UNLEAK(dir) directive, which to me suggests that the method to free the data was too unclear. I suspect the non-obviousness of the API and its holes led folks to avoid it, which then snowballed into further problems with the entries[], ignored[], parent_hashmap, and recursive_hashmap problems. Rename clear_directory() to dir_clear() to be more in line with other data structures in git, and introduce a dir_init() to handle the suggested memsetting of dir_struct to all zeroes. I hope that a name like "dir_clear()" is more clear, and that the presence of dir_init() will provide a hint to those looking at the code that they need to look for either a dir_clear() or a dir_free() and lead them to find dir_clear(). Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-19dir: make clear_directory() free all relevant memoryElijah Newren
The calling convention for the dir API is supposed to end with a call to clear_directory() to free up no longer needed memory. However, clear_directory() didn't free dir->entries or dir->ignored. I believe this was an oversight, but a number of callers noticed memory leaks and started free'ing these. Unfortunately, they did so somewhat haphazardly (sometimes freeing the entries in the arrays, and sometimes only free'ing the arrays themselves). This suggests the callers weren't trying to make sure any possible memory used might be free'd, but just the memory they noticed their usecase definitely had allocated. Fix this mess by moving all the duplicated free'ing logic into clear_directory(). End by resetting dir to a pristine state so it could be reused if desired. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-14clear_pattern_list(): clear embedded hashmapsJeff King
Commit 96cc8ab531 (sparse-checkout: use hashmaps for cone patterns, 2019-11-21) added some auxiliary hashmaps to the pattern_list struct, but they're leaked when clear_pattern_list() is called. Signed-off-by: Jeff King <peff@peff.net> Acked-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-12dir: avoid prematurely marking nonbare repositories as matchesElijah Newren
Nonbare repositories are special directories. Unlike normal directories that we might recurse into to list the files they contain, nonbare repositories must themselves match and then we always report only on the nonbare repository directory itself and not on any of its contents. Separately, when traversing directories to try to find untracked or excluded files, we often think in terms of paths either matching the specified pathspec, or not matching them. However, there is a special value that do_match_pathspec() uses named MATCHED_RECURSIVELY_LEADING_PATHSPEC which means "this directory does not match any pathspec BUT it is possible a file or directory underneath it does." That special value prevents us from prematurely thinking that some directory and everything under it is irrelevant, but also allows us to differentiate from "this is a match". The combination of these two special cases was previously uncovered. Add a test to the testsuite to cover it, and make sure that we return a nonbare repository as a non-match if the best match it got was MATCHED_RECURSIVELY_LEADING_PATHSPEC. Reported-by: christian w <usebees@gmail.com> Simplified-testcase-and-bisection-by: Kyle Meyer <kyle@kyleam.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-30Merge branch 'en/fill-directory-exponential' into masterJunio C Hamano
Fix to a regression introduced during 2.27 cycle. * en/fill-directory-exponential: dir: check pathspecs before returning `path_excluded`
2020-07-20dir: check pathspecs before returning `path_excluded`Martin Ågren
In 95c11ecc73 ("Fix error-prone fill_directory() API; make it only return matches", 2020-04-01), we taught `fill_directory()`, or more specifically `treat_path()`, to check against any pathspecs so that we could simplify the callers. But in doing so, we added a slightly-too-early return for the "excluded" case. We end up not checking the pathspecs, meaning we return `path_excluded` when maybe we should return `path_none`. As a result, `git status --ignored -- pathspec` might show paths that don't actually match "pathspec". Move the "excluded" check down to after we've checked any pathspecs. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Martin Ågren <martin.agren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-25Merge branch 'en/clean-cleanups'Junio C Hamano
Code clean-up of "git clean" resulted in a fix of recent performance regression. * en/clean-cleanups: clean: optimize and document cases where we recurse into subdirectories clean: consolidate handling of ignored parameters dir, clean: avoid disallowed behavior dir: fix a few confusing comments
2020-06-18Merge branch 'en/do-match-pathspec-fix'Junio C Hamano
Use of negative pathspec, while collecting paths including untracked ones in the working tree, was broken. * en/do-match-pathspec-fix: dir: fix treatment of negated pathspecs
2020-06-13dir, clean: avoid disallowed behaviorElijah Newren
dir.h documented quite clearly that DIR_SHOW_IGNORED and DIR_SHOW_IGNORED_TOO are mutually exclusive, with a big comment to this effect by the definition of both enum values. However, a command like git clean -fx $DIR would set both values for dir.flags. I _think_ it happened to work because: * As dir.h points out, DIR_KEEP_UNTRACKED_CONTENTS only takes effect if DIR_SHOW_IGNORED_TOO is set. * As coded, I believe DIR_SHOW_IGNORED would just happen to take precedence over DIR_SHOW_IGNORED_TOO in the code as currently constructed. Which is a long way of saying "we just got lucky". Fix clean.c to avoid setting these mutually exclusive values at the same time, and add a check to dir.c that will throw a BUG() to prevent anyone else from making this mistake. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-13dir: fix a few confusing commentsElijah Newren
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-05dir: fix treatment of negated pathspecsElijah Newren
do_match_pathspec() started life as match_pathspec_depth_1() and for correctness was only supposed to be called from match_pathspec_depth(). match_pathspec_depth() was later renamed to match_pathspec(), so the invariant we expect today is that do_match_pathspec() has no direct callers outside of match_pathspec(). Unfortunately, this intention was lost with the renames of the two functions, and additional calls to do_match_pathspec() were added in commits 75a6315f74 ("ls-files: add pathspec matching for submodules", 2016-10-07) and 89a1f4aaf7 ("dir: if our pathspec might match files under a dir, recurse into it", 2019-09-17). Of course, do_match_pathspec() had an important advantge over match_pathspec() -- match_pathspec() would hardcode flags to one of two values, and these new callers needed to pass some other value for flags. Also, although calling do_match_pathspec() directly was incorrect, there likely wasn't any difference in the observable end output, because the bug just meant that fill_diretory() would recurse into unneeded directories. Since subsequent does-this-path-match checks on individual paths under the directory would cause those extra paths to be filtered out, the only difference from using the wrong function was unnecessary computation. The second of those bad calls to do_match_pathspec() was involved -- via either direct movement or via copying+editing -- into a number of later refactors. See commits 777b420347 ("dir: synchronize treat_leading_path() and read_directory_recursive()", 2019-12-19), 8d92fb2927 ("dir: replace exponential algorithm with a linear one", 2020-04-01), and 95c11ecc73 ("Fix error-prone fill_directory() API; make it only return matches", 2020-04-01). The last of those introduced the usage of do_match_pathspec() on an individual file, and thus resulted in individual paths being returned that shouldn't be. The problem with calling do_match_pathspec() instead of match_pathspec() is that any negated patterns such as ':!unwanted_path` will be ignored. Add a new match_pathspec_with_flags() function to fulfill the needs of specifying special flags while still correctly checking negated patterns, add a big comment above do_match_pathspec() to prevent others from misusing it, and correct current callers of do_match_pathspec() to instead use either match_pathspec() or match_pathspec_with_flags(). One final note is that DO_MATCH_LEADING_PATHSPEC needs special consideration when working with DO_MATCH_EXCLUDE. The point of DO_MATCH_LEADING_PATHSPEC is that if we have a pathspec like */Makefile and we are checking a directory path like src/module/component that we want to consider it a match so that we recurse into the directory because it _might_ have a file named Makefile somewhere below. However, when we are using an exclusion pattern, i.e. we have a pathspec like :(exclude)*/Makefile we do NOT want to say that a directory path like src/module/component is a (negative) match. While there *might* be a file named 'Makefile' somewhere below that directory, there could also be other files and we cannot pre-emptively rule all the files under that directory out; we need to recurse and then check individual files. Adjust the DO_MATCH_LEADING_PATHSPEC logic to only get activated for positive pathspecs. Reported-by: John Millikin <jmillikin@stripe.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-29Merge branch 'en/fill-directory-exponential'Junio C Hamano
The directory traversal code had redundant recursive calls which made its performance characteristics exponential with respect to the depth of the tree, which was corrected. * en/fill-directory-exponential: completion: fix 'git add' on paths under an untracked directory Fix error-prone fill_directory() API; make it only return matches dir: replace double pathspec matching with single in treat_directory() dir: include DIR_KEEP_UNTRACKED_CONTENTS handling in treat_directory() dir: replace exponential algorithm with a linear one dir: refactor treat_directory to clarify control flow dir: fix confusion based on variable tense dir: fix broken comment dir: consolidate treat_path() and treat_one_path() dir: fix simple typo in comment t3000: add more testcases testing a variety of ls-files issues t7063: more thorough status checking
2020-04-01Fix error-prone fill_directory() API; make it only return matchesElijah Newren
Traditionally, the expected calling convention for the dir.c API was: fill_directory(&dir, ..., pathspec) foreach entry in dir->entries: if (dir_path_match(entry, pathspec)) process_or_display(entry) This may have made sense once upon a time, because the fill_directory() call could use cheap checks to avoid doing full pathspec matching, and an external caller may have wanted to do other post-processing of the results anyway. However: * this structure makes it easy for users of the API to get it wrong * this structure actually makes it harder to understand fill_directory() and the functions it uses internally. It has tripped me up several times while trying to fix bugs and restructure things. * relying on post-filtering was already found to produce wrong results; pathspec matching had to be added internally for multiple cases in order to get the right results (see commits 404ebceda01c (dir: also check directories for matching pathspecs, 2019-09-17) and 89a1f4aaf765 (dir: if our pathspec might match files under a dir, recurse into it, 2019-09-17)) * it's bad for performance: fill_directory() already has to do lots of checks and knows the subset of cases where it still needs to do more checks. Forcing external callers to do full pathspec matching means they must re-check _every_ path. So, add the pathspec matching within the fill_directory() internals, and remove it from external callers. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-01dir: replace double pathspec matching with single in treat_directory()Elijah Newren
treat_directory() had a call to both do_match_pathspec() and match_pathspec(). These calls have migrated through the code somewhat since their introduction, but we don't actually need both. Replace the two calls with one, and while at it, move the check earlier in order to reduce the need for callers of fill_directory() to do post-filtering of results. The next patch will address post-filtering more forcefully and provide more relevant history and context. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-01dir: include DIR_KEEP_UNTRACKED_CONTENTS handling in treat_directory()Elijah Newren
Handling DIR_KEEP_UNTRACKED_CONTENTS within treat_directory() instead of as a post-processing step in read_directory(): * allows us to directly access and remove the relevant entries instead of needing to calculate which ones need to be removed * keeps the logic for directory handling in one location (and puts it closer the the logic for stripping out extra ignored entries, which seems logical). Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-01dir: replace exponential algorithm with a linear oneElijah Newren
dir's read_directory_recursive() naturally operates recursively in order to walk the directory tree. Treating of directories is sometimes weird because there are so many different permutations about how to handle directories. Some examples: * 'git ls-files -o --directory' only needs to know that a directory itself is untracked; it doesn't need to recurse into it to see what is underneath. * 'git status' needs to recurse into an untracked directory, but only to determine whether or not it is empty. If there are no files underneath, the directory itself will be omitted from the output. If it is not empty, only the directory will be listed. * 'git status --ignored' needs to recurse into untracked directories and report all the ignored entries and then report the directory as untracked -- UNLESS all the entries under the directory are ignored, in which case we don't print any of the entries under the directory and just report the directory itself as ignored. (Note that although this forces us to walk all untracked files underneath the directory as well, we strip them from the output, except for users like 'git clean' who also set DIR_KEEP_TRACKED_CONTENTS.) * For 'git clean', we may need to recurse into a directory that doesn't match any specified pathspecs, if it's possible that there is an entry underneath the directory that can match one of the pathspecs. In such a case, we need to be careful to omit the directory itself from the list of paths (see commit 404ebceda01c ("dir: also check directories for matching pathspecs", 2019-09-17)) Part of the tension noted above is that the treatment of a directory can change based on the files within it, and based on the various settings in dir->flags. Trying to keep this in mind while reading over the code, it is easy to think in terms of "treat_directory() tells us what to do with a directory, and read_directory_recursive() is the thing that recurses". Since we need to look into a directory to know how to treat it, though, it is quite easy to decide to (also) recurse into the directory from treat_directory() by adding a read_directory_recursive() call. Adding such a call is actually fine, IF we make sure that read_directory_recursive() does not also recurse into that same directory. Unfortunately, commit df5bcdf83aeb ("dir: recurse into untracked dirs for ignored files", 2017-05-18), added exactly such a case to the code, meaning we'd have two calls to read_directory_recursive() for an untracked directory. So, if we had a file named one/two/three/four/five/somefile.txt and nothing in one/ was tracked, then 'git status --ignored' would call read_directory_recursive() twice on the directory 'one/', and each of those would call read_directory_recursive() twice on the directory 'one/two/', and so on until read_directory_recursive() was called 2^5 times for 'one/two/three/four/five/'. Avoid calling read_directory_recursive() twice per level by moving a lot of the special logic into treat_directory(). Since dir.c is somewhat complex, extra cruft built up around this over time. While trying to unravel it, I noticed several instances where the first call to read_directory_recursive() would return e.g. path_untracked for some directory and a later one would return e.g. path_none, despite the fact that the directory clearly should have been considered untracked. The code happened to work due to the side-effect from the first invocation of adding untracked entries to dir->entries; this allowed it to get the correct output despite the supposed override in return value by the later call. I am somewhat concerned that there are still bugs and maybe even testcases with the wrong expectation. I have tried to carefully document treat_directory() since it becomes more complex after this change (though much of this complexity came from elsewhere that probably deserved better comments to begin with). However, much of my work felt more like a game of whackamole while attempting to make the code match the existing regression tests than an attempt to create an implementation that matched some clear design. That seems wrong to me, but the rules of existing behavior had so many special cases that I had a hard time coming up with some overarching rules about what correct behavior is for all cases, forcing me to hope that the regression tests are correct and sufficient. Such a hope seems likely to be ill-founded, given my experience with dir.c-related testcases in the last few months: Examples where the documentation was hard to parse or even just wrong: * 3aca58045f4f (git-clean.txt: do not claim we will delete files with -n/--dry-run, 2019-09-17) * 09487f2cbad3 (clean: avoid removing untracked files in a nested git repository, 2019-09-17) * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17) Examples where testcases were declared wrong and changed: * 09487f2cbad3 (clean: avoid removing untracked files in a nested git repository, 2019-09-17) * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17) * a2b13367fe55 (Revert "dir.c: make 'git-status --ignored' work within leading directories", 2019-12-10) Examples where testcases were clearly inadequate: * 502c386ff944 (t7300-clean: demonstrate deleting nested repo with an ignored file breakage, 2019-08-25) * 7541cc530239 (t7300: add testcases showing failure to clean specified pathspecs, 2019-09-17) * a5e916c7453b (dir: fix off-by-one error in match_pathspec_item, 2019-09-17) * 404ebceda01c (dir: also check directories for matching pathspecs, 2019-09-17) * 09487f2cbad3 (clean: avoid removing untracked files in a nested git repository, 2019-09-17) * e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17) * 452efd11fbf6 (t3011: demonstrate directory traversal failures, 2019-12-10) * b9670c1f5e6b (dir: fix checks on common prefix directory, 2019-12-19) Examples where "correct behavior" was unclear to everyone: https://lore.kernel.org/git/20190905154735.29784-1-newren@gmail.com/ Other commits of note: * 902b90cf42bc (clean: fix theoretical path corruption, 2019-09-17) However, on the positive side, it does make the code much faster. For the following simple shell loop in an empty repository: for depth in $(seq 10 25) do dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done) rm -rf dir mkdir -p $dirs >$dirs/untracked-file /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null done I saw the following timings, in seconds (note that the numbers are a little noisy from run-to-run, but the trend is very clear with every run): 10: 0.03 11: 0.05 12: 0.08 13: 0.19 14: 0.29 15: 0.50 16: 1.05 17: 2.11 18: 4.11 19: 8.60 20: 17.55 21: 33.87 22: 68.71 23: 140.05 24: 274.45 25: 551.15 For the above run, using strace I can look for the number of untracked directories opened and can verify that it matches the expected 2^($depth+1)-2 (the sum of 2^1 + 2^2 + 2^3 + ... + 2^$depth). After this fix, with strace I can verify that the number of untracked directories that are opened drops to just $depth, and the timings all drop to 0.00. In fact, it isn't until a depth of 190 nested directories that it sometimes starts reporting a time of 0.01 seconds and doesn't consistently report 0.01 seconds until there are 240 nested directories. The previous code would have taken 17.55 * 2^220 / (60*60*24*365) = 9.4 * 10^59 YEARS to have completed the 240 nested directories case. It's not often that you get to speed something up by a factor of 3*10^69. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-01dir: refactor treat_directory to clarify control flowDerrick Stolee
The logic in treat_directory() is handled by a multi-case switch statement, but this switch is very asymmetrical, as the first two cases are simple but the third is more complicated than the rest of the method. In fact, the third case includes a "break" statement that leads to the block of code outside the switch statement. That is the only way to reach that block, as the switch handles all possible values from directory_exists_in_index(); Extract the switch statement into a series of "if" statements. This simplifies the trivial cases, while clarifying how to reach the "show_other_directories" case. This is particularly important as the "show_other_directories" case will expand in a later change. Helped-by: Elijah Newren <newren@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-01dir: fix confusion based on variable tenseElijah Newren
Despite having contributed several fixes in this area, I have for months (years?) assumed that the "exclude" variable was a directive; this caused me to think of it as a different mode we operate in and left me confused as I tried to build up a mental model around why we'd need such a directive. I mostly tried to ignore it while focusing on the pieces I was trying to understand. Then I finally traced this variable all back to a call to is_excluded(), meaning it was actually functioning as an adjective. In particular, it was a checked property ("Does this path match a rule in .gitignore?"), rather than a mode passed in from the caller. Change the variable name to match the part of speech used by the function called to define it, which will hopefully make these bits of code slightly clearer to the next reader. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-01dir: fix broken commentElijah Newren
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-01dir: consolidate treat_path() and treat_one_path()Elijah Newren
Commit 16e2cfa90993 ("read_directory(): further split treat_path()", 2010-01-08) split treat_one_path() out of treat_path(), because treat_leading_path() would not have access to a dirent but wanted to re-use as much of treat_path() as possible. Not re-using all of treat_path() caused other bugs, as noted in commit b9670c1f5e6b ("dir: fix checks on common prefix directory", 2019-12-19). Finally, in commit ad6f2157f951 ("dir: restructure in a way to avoid passing around a struct dirent", 2020-01-16), dirents were removed from treat_path() and other functions entirely. Since the only reason for splitting these functions was the lack of a dirent -- which no longer applies to either function -- and since the split caused problems in the past resulting in us not using treat_one_path() separately anymore, just undo the split. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-01dir: fix simple typo in commentElijah Newren
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-05Merge branch 'ds/sparse-add'Junio C Hamano
"git sparse-checkout" learned a new "add" subcommand. * ds/sparse-add: sparse-checkout: allow one-character directories in cone mode sparse-checkout: work with Windows paths sparse-checkout: create 'add' subcommand sparse-checkout: extract pattern update from 'set' subcommand sparse-checkout: extract add_patterns_from_input()
2020-02-20sparse-checkout: allow one-character directories in cone modeDerrick Stolee
In 9e6d3e64 (sparse-checkout: detect short patterns, 2020-01-24), a condition on the minimum length of a cone-mode pattern was introduced. However, this condition was off-by-one. If we have a directory with a single character, say "b", then the command git sparse-checkout set b will correctly add the pattern "/b/" to the sparse-checkout file. When this is interpeted in dir.c, the pattern is "/b" with the PATTERN_FLAG_MUSTBEDIR flag. This string has length two, which satisfies our inclusive inequality (<= 2). The reason for this inequality is that we will start to read the pattern string character-by-character using three char pointers: prev, cur, next. In particular, next is set to the current pattern plus two. The mistake was that next will still be a valid pointer when the pattern length is two, since the string is null-terminated. Make this inequality strict so these patterns work. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-14Merge branch 'mt/use-passed-repo-more-in-funcs'Junio C Hamano
Some codepaths were given a repository instance as a parameter to work in the repository, but passed the_repository instance to its callees, which has been cleaned up (somewhat). * mt/use-passed-repo-more-in-funcs: sha1-file: allow check_object_signature() to handle any repo sha1-file: pass git_hash_algo to hash_object_file() sha1-file: pass git_hash_algo to write_object_file_prepare() streaming: allow open_istream() to handle any repo pack-check: use given repo's hash_algo at verify_packfile() cache-tree: use given repo's hash_algo at verify_one() diff: make diff_populate_filespec() honor its repo argument
2020-02-14Merge branch 'ds/sparse-checkout-harden'Junio C Hamano
Some rough edges in the sparse-checkout feature, especially around the cone mode, have been cleaned up. * ds/sparse-checkout-harden: sparse-checkout: fix cone mode behavior mismatch sparse-checkout: improve docs around 'set' in cone mode sparse-checkout: escape all glob characters on write sparse-checkout: use C-style quotes in 'list' subcommand sparse-checkout: unquote C-style strings over --stdin sparse-checkout: write escaped patterns in cone mode sparse-checkout: properly match escaped characters sparse-checkout: warn on globs in cone patterns sparse-checkout: detect short patterns sparse-checkout: cone mode does not recognize "**" sparse-checkout: fix documentation typo for core.sparseCheckoutCone clone: fix --sparse option with URLs sparse-checkout: create leading directories t1091: improve here-docs t1091: use check_files to reduce boilerplate
2020-01-31sparse-checkout: properly match escaped charactersDerrick Stolee
In cone mode, the sparse-checkout feature uses hashset containment queries to match paths. Make this algorithm respect escaped asterisk (*) and backslash (\) characters. Create dup_and_filter_pattern() method to convert a pattern by removing escape characters and dropping an optional "/*" at the end. This method is available in dir.h as we will use it in builtin/sparse-checkout.c in a later change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-31sparse-checkout: warn on globs in cone patternsDerrick Stolee
In cone mode, the sparse-checkout commmand will write patterns that allow faster pattern matching. This matching only works if the patterns in the sparse-checkout file are those written by that command. Users can edit the sparse-checkout file and create patterns that cause the cone mode matching to fail. The cone mode patterns may end in "/*" but otherwise an un-escaped asterisk or other glob character is invalid. Add checks to disable cone mode when seeing these values. A later change will properly handle escaped globs. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-31sha1-file: pass git_hash_algo to hash_object_file()Matheus Tavares
Allow hash_object_file() to work on arbitrary repos by introducing a git_hash_algo parameter. Change callers which have a struct repository pointer in their scope to pass on the git_hash_algo from the said repo. For all other callers, pass on the_hash_algo, which was already being used internally at hash_object_file(). This functionality will be used in the following patch to make check_object_signature() be able to work on arbitrary repos (which, in turn, will be used to fix an inconsistency at object.c:parse_object()). Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-24sparse-checkout: detect short patternsDerrick Stolee
In cone mode, the shortest pattern the sparse-checkout command will write into the sparse-checkout file is "/*". This is handled carefully in add_pattern_to_hashsets(), so warn if any other pattern is this short. This will assist future pattern checks by allowing us to assume there are at least three characters in the pattern. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-24sparse-checkout: cone mode does not recognize "**"Derrick Stolee
When core.sparseCheckoutCone is enabled, the 'git sparse-checkout set' command creates a restricted set of possible patterns that are used by a custom algorithm to quickly match those patterns. If a user manually edits the sparse-checkout file, then they could create patterns that do not match these expectations. The cone-mode matching algorithm can return incorrect results. The solution is to detect these incorrect patterns, warn that we do not recognize them, and revert to the standard algorithm. Check each pattern for the "**" substring, and revert to the old logic if seen. While technically a "/<dir>/**" pattern matches the meaning of "/<dir>/", it is not one that would be written by the sparse-checkout builtin in cone mode. Attempting to accept that pattern change complicates the logic and instead we punt and do not accept any instance of "**". Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16dir: point treat_leading_path() warning to the right placeJeff King
Commit 777b420347 (dir: synchronize treat_leading_path() and read_directory_recursive(), 2019-12-19) tried to add two warning comments in those functions, pointing at each other. But the one in treat_leading_path() just points at itself. Let's fix that. Since the comment also redirects the reader for more details to "the commit that added this warning", and since we're now modifying the warning (creating a new commit without those details), let's mention the actual commit id. Signed-off-by: Jeff King <peff@peff.net> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16dir: restructure in a way to avoid passing around a struct direntJeff King
Restructure the code slightly to avoid passing around a struct dirent anywhere, which also enables us to avoid trying to manufacture one. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16dir: treat_leading_path() and read_directory_recursive(), round 2Elijah Newren
I was going to title this "dir: more synchronizing of treat_leading_path() and read_directory_recursive()", a nod to commit 777b42034764 ("dir: synchronize treat_leading_path() and read_directory_recursive()", 2019-12-19), but the title was too long. Anyway, first the backstory... fill_directory() has always had a slightly error-prone interface: it returns a subset of paths which *might* match the specified pathspec; it was intended to prune away some paths which didn't match the specified pathspec and keep at least all the ones that did match it. Given this interface, callers were responsible to post-process the results and check whether each actually matched the pathspec. builtin/clean.c did this. It would first prune out duplicates (e.g. if "dir" was returned as well as all files under "dir/", then it would simplify this to just "dir"), and after pruning duplicates it would compare the remaining paths to the specified pathspec(s). This post-processing itself could run into problems, though, as noted in commit 404ebceda01c ("dir: also check directories for matching pathspecs", 2019-09-17): For the case of git-clean and a set of pathspecs of "dir/file" and "more", this caused a problem because we'd end up with dir entries for both of "dir" "dir/file" Then correct_untracked_entries() would try to helpfully prune duplicates for us by removing "dir/file" since it's under "dir", leaving us with "dir" Since the original pathspec only had "dir/file", the only entry left doesn't match and leaves nothing to be removed. (Note that if only one pathspec was specified, e.g. only "dir/file", then the common_prefix_len optimizations in fill_directory would cause us to bypass this problem, making it appear in simple tests that we could correctly remove manually specified pathspecs.) That commit fixed the issue -- when multiple pathspecs were specified -- by making sure fill_directory() wouldn't return both "dir" and "dir/file" outside the common_prefix_len optimization path. This is where it starts to get fun. In commit b9670c1f5e6b ("dir: fix checks on common prefix directory", 2019-12-19), we noticed that the common_prefix_len wasn't doing appropriate checks and letting all kinds of stuff through, resulting in recursing into .git/ directories and other craziness. So it started locking down and doing checks on pathnames within that code path. That continued with commit 777b42034764 ("dir: synchronize treat_leading_path() and read_directory_recursive()", 2019-12-19), which noted the following: Our optimization to avoid calling into read_directory_recursive() when all pathspecs have a common leading directory mean that we need to match the logic that read_directory_recursive() would use if we had just called it from the root. Since it does more than call treat_path() we need to copy that same logic. ...and then it more forcefully addressed the issue with this wonderfully ironic statement: Needing to duplicate logic like this means it is guaranteed someone will eventually need to make further changes and forget to update both locations. It is tempting to just nuke the leading_directory special casing to avoid such bugs and simplify the code, but unpack_trees' verify_clean_subdirectory() also calls read_directory() and does so with a non-empty leading path, so I'm hesitant to try to restructure further. Add obnoxious warnings to treat_leading_path() and read_directory_recursive() to try to warn people of such problems. You would think that with such a strongly worded description, that its author would have actually ensured that the logic in treat_leading_path() and read_directory_recursive() did actually match and that *everything* that was needed had at least been copied over at the time that this paragraph was written. But you'd be wrong, I messed it up by missing part of the logic. Copy the missing bits to fix the new final test in t7300. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-25Merge branch 'en/fill-directory-fixes'Junio C Hamano
Assorted fixes to the directory traversal API. * en/fill-directory-fixes: dir.c: use st_add3() for allocation size dir: consolidate similar code in treat_directory() dir: synchronize treat_leading_path() and read_directory_recursive() dir: fix checks on common prefix directory dir: break part of read_directory_recursive() out for reuse dir: exit before wildcard fall-through if there is no wildcard dir: remove stray quote character in comment Revert "dir.c: make 'git-status --ignored' work within leading directories" t3011: demonstrate directory traversal failures
2019-12-25Merge branch 'ds/sparse-cone'Junio C Hamano
Management of sparsely checked-out working tree has gained a dedicated "sparse-checkout" command. * ds/sparse-cone: (21 commits) sparse-checkout: improve OS ls compatibility sparse-checkout: respect core.ignoreCase in cone mode sparse-checkout: check for dirty status sparse-checkout: update working directory in-process for 'init' sparse-checkout: cone mode should not interact with .gitignore sparse-checkout: write using lockfile sparse-checkout: use in-process update for disable subcommand sparse-checkout: update working directory in-process sparse-checkout: sanitize for nested folders unpack-trees: add progress to clear_ce_flags() unpack-trees: hash less in cone mode sparse-checkout: init and set in cone mode sparse-checkout: use hashmaps for cone patterns sparse-checkout: add 'cone' mode trace2: add region in clear_ce_flags sparse-checkout: create 'disable' subcommand sparse-checkout: add '--stdin' option to set subcommand sparse-checkout: 'set' subcommand clone: add --sparse mode sparse-checkout: create 'init' subcommand ...
2019-12-20dir.c: use st_add3() for allocation sizeJunio C Hamano
When preparing a manufactured dirent instance, we add a length of path to the size of struct to decide how many bytes to allocate. Make sure this addition does not wrap-around to cause us underallocate. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-19dir: consolidate similar code in treat_directory()Elijah Newren
Both the DIR_SKIP_NESTED_GIT and DIR_NO_GITLINKS cases were checking for whether a path was actually a nonbare repository. That code could be shared, with just the result of how to act differing between the two cases. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-19dir: synchronize treat_leading_path() and read_directory_recursive()Elijah Newren
Our optimization to avoid calling into read_directory_recursive() when all pathspecs have a common leading directory mean that we need to match the logic that read_directory_recursive() would use if we had just called it from the root. Since it does more than call treat_path() we need to copy that same logic. Alternatively, we could try to change treat_path to return path_recurse for an untracked directory under the given special circumstances that this logic checks for, but a simple switch results in many test failures such as 'git clean -d' not wiping out untracked but empty directories. To work around that, we'd need the caller of treat_path to check for path_recurse and sometimes special case it into path_untracked. In other words, we'd still have extra logic in both places. Needing to duplicate logic like this means it is guaranteed someone will eventually need to make further changes and forget to update both locations. It is tempting to just nuke the leading_directory special casing to avoid such bugs and simplify the code, but unpack_trees' verify_clean_subdirectory() also calls read_directory() and does so with a non-empty leading path, so I'm hesitant to try to restructure further. Add obnoxious warnings to treat_leading_path() and read_directory_recursive() to try to warn people of such problems. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-19dir: fix checks on common prefix directoryElijah Newren
Many years ago, the directory traversing logic had an optimization that would always recurse into any directory that was a common prefix of all the pathspecs without walking the leading directories to get down to the desired directory. Thus, git ls-files -o .git/ # case A would notice that .git/ was a common prefix of all pathspecs (since it is the only pathspec listed), and then traverse into it and start showing unknown files under that directory. Unfortunately, .git/ is not a directory we should be traversing into, which made this optimization problematic. This also affected cases like git ls-files -o --exclude-standard t/ # case B where t/ was in the .gitignore file and thus isn't interesting and shouldn't be recursed into. It also affected cases like git ls-files -o --directory untracked_dir/ # case C where untracked_dir/ is indeed untracked and thus interesting, but the --directory flag means we only want to show the directory itself, not recurse into it and start listing untracked files below it. The case B class of bugs were noted and fixed in commits 16e2cfa90993 ("read_directory(): further split treat_path()", 2010-01-08) and 48ffef966c76 ("ls-files: fix overeager pathspec optimization", 2010-01-08), with the idea being that we first wanted to check whether the common prefix was interesting. The former patch noted that treat_path() couldn't be used when checking the common prefix because treat_path() requires a dir_entry() and we haven't read any directories at the point we are checking the common prefix. So, that patch split treat_one_path() out of treat_path(). The latter patch then created a new treat_leading_path() which duplicated by hand the bits of treat_path() that couldn't be broken out and then called treat_one_path() for the remainder. There were three problems with this approach: * The duplicated logic in treat_leading_path() accidentally missed the check for special paths (such as is_dot_or_dotdot and matching ".git"), causing case A types of bugs to continue to be an issue. * The treat_leading_path() logic assumed we should traverse into anything where path_treatment was not path_none, i.e. it perpetuated class C types of bugs. * It meant we had split logic that needed to kept in sync, running the risk that people introduced new inconsistencies (such as in commit be8a84c52669, which we reverted earlier in this series, or in commit df5bcdf83ae which we'll fix in a subsequent commit) Fix most these problems by making treat_leading_path() not only loop over each leading path component, but calling treat_path() directly on each. To do so, we have to create a synthetic dir_entry, but that only takes a few lines. Then, pay attention to the path_treatment result we get from treat_path() and don't treat path_excluded, path_untracked, and path_recurse all the same as path_recurse. This leaves one remaining problem, the new inconsistency from commit df5bcdf83ae. That will be addressed in a subsequent commit. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-16Merge branch 'hw/doc-in-header'Junio C Hamano
* hw/doc-in-header: trace2: move doc to trace2.h submodule-config: move doc to submodule-config.h tree-walk: move doc to tree-walk.h trace: move doc to trace.h run-command: move doc to run-command.h parse-options: add link to doc file in parse-options.h credential: move doc to credential.h argv-array: move doc to argv-array.h cache: move doc to cache.h sigchain: move doc to sigchain.h pathspec: move doc to pathspec.h revision: move doc to revision.h attr: move doc to attr.h refs: move doc to refs.h remote: move doc to remote.h and refspec.h sha1-array: move doc to sha1-array.h merge: move doc to ll-merge.h graph: move doc to graph.h and graph.c dir: move doc to dir.h diff: move doc to diff.h and diffcore.h
2019-12-13sparse-checkout: respect core.ignoreCase in cone modeDerrick Stolee
When a user uses the sparse-checkout feature in cone mode, they add patterns using "git sparse-checkout set <dir1> <dir2> ..." or by using "--stdin" to provide the directories line-by-line over stdin. This behaviour naturally looks a lot like the way a user would type "git add <dir1> <dir2> ..." If core.ignoreCase is enabled, then "git add" will match the input using a case-insensitive match. Do the same for the sparse-checkout feature. Perform case-insensitive checks while updating the skip-worktree bits during unpack_trees(). This is done by changing the hash algorithm and hashmap comparison methods to optionally use case- insensitive methods. When this is enabled, there is a small performance cost in the hashing algorithm. To tease out the worst possible case, the following was run on a repo with a deep directory structure: git ls-tree -d -r --name-only HEAD | git sparse-checkout set --stdin The 'set' command was timed with core.ignoreCase disabled or enabled. For the repo with a deep history, the numbers were core.ignoreCase=false: 62s core.ignoreCase=true: 74s (+19.3%) For reproducibility, the equivalent test on the Linux kernel repository had these numbers: core.ignoreCase=false: 3.1s core.ignoreCase=true: 3.6s (+16%) Now, this is not an entirely fair comparison, as most users will define their sparse cone using more shallow directories, and the performance improvement from eb42feca97 ("unpack-trees: hash less in cone mode" 2019-11-21) can remove most of the hash cost. For a more realistic test, drop the "-r" from the ls-tree command to store only the first-level directories. In that case, the Linux kernel repository takes 0.2-0.25s in each case, and the deep repository takes one second, plus or minus 0.05s, in each case. Thus, we _can_ demonstrate a cost to this change, but it is unlikely to matter to any reasonable sparse-checkout cone. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-11dir: break part of read_directory_recursive() out for reuseElijah Newren
Create an add_path_to_appropriate_result_list() function from the code at the end of read_directory_recursive() so we can use it elsewhere. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>