path: root/sha1-lookup.c
AgeCommit message (Collapse)Author
2019-10-15Merge branch 'js/azure-pipelines-msvc'Junio C Hamano
CI updates. * js/azure-pipelines-msvc: ci: also build and test with MS Visual Studio on Azure Pipelines ci: really use shallow clones on Azure Pipelines tests: let --immediate and --write-junit-xml play well together test-tool run-command: learn to run (parts of) the testsuite vcxproj: include more generated files vcxproj: only copy `git-remote-http.exe` once it was built msvc: work around a bug in GetEnvironmentVariable() msvc: handle DEVELOPER=1 msvc: ignore some libraries when linking compat/win32/path-utils.h: add #include guards winansi: use FLEX_ARRAY to avoid compiler warning msvc: avoid using minus operator on unsigned types push: do not pretend to return `int` from `die_push_simple()`
2019-10-06msvc: avoid using minus operator on unsigned typesJohannes Schindelin
MSVC complains about this with `-Wall`, which can be taken as a sign that this is indeed a real bug. The symptom is: C4146: unary minus operator applied to unsigned type, result still unsigned Let's avoid this warning in the minimal way, e.g. writing `-1 - <unsigned value>` instead of `-<unsigned value> - 1`. Note that the change in the `estimate_cache_size()` function is needed because MSVC considers the "return type" of the `sizeof()` operator to be `size_t`, i.e. unsigned, and therefore it cannot be negated using the unary minus operator. Even worse, that arithmetic is doing extra work, in vain. We want to calculate the entry extra cache size as the difference between the size of the `cache_entry` structure minus the size of the `ondisk_cache_entry` structure, padded to the appropriate alignment boundary. To that end, we start by assigning that difference to the `per_entry` variable, and then abuse the `len` parameter of the `align_padding_size()` macro to take the negative size of the ondisk entry size. Essentially, we try to avoid passing the already calculated difference to that macro by passing the operands of that difference instead, when the macro expects operands of an addition: #define align_padding_size(size, len) \ ((size + (len) + 8) & ~7) - (size + len) Currently, we pass A and -B to that macro instead of passing A - B and 0, where A - B is already stored in the `per_entry` variable, ready to be used. This is neither necessary, nor intuitive. Let's fix this, and have code that is both easier to read and that also does not trigger MSVC's warning. While at it, we take care of reporting overflows (which are unlikely, but hey, defensive programming is good!). We _also_ take pains of casting the unsigned value to signed: otherwise, the signed operand (i.e. the `-1`) would be cast to unsigned before doing the arithmetic. Helped-by: Denton Liu <> Signed-off-by: Johannes Schindelin <> Signed-off-by: Junio C Hamano <>
2019-08-19sha1-lookup: switch hard-coded constants to the_hash_algobrian m. carlson
Switch a hard-coded 18 to be a reference to the_hash_algo. Rename the sha1 parameter to be called hash instead. Signed-off-by: brian m. carlson <> Signed-off-by: Junio C Hamano <>
2018-05-06Replace all die("BUG: ...") calls by BUG() onesJohannes Schindelin
In d8193743e08 (usage.c: add BUG() function, 2017-05-12), a new macro was introduced to use for reporting bugs instead of die(). It was then subsequently used to convert one single caller in 588a538ae55 (setup_git_env: convert die("BUG") to BUG(), 2017-05-12). The cover letter of the patch series containing this patch (cf is not terribly clear why only one call site was converted, or what the plan is for other, similar calls to die() to report bugs. Let's just convert all remaining ones in one fell swoop. This trick was performed by this invocation: sed -i 's/die("BUG: /BUG("/g' $(git grep -l 'die("BUG' \*.c) Signed-off-by: Johannes Schindelin <> Signed-off-by: Junio C Hamano <>
2018-02-15packfile: refactor hash search with fanout tableJonathan Tan
Subsequent patches will introduce file formats that make use of a fanout array and a sorted table containing hashes, just like packfiles. Refactor the hash search in packfile.c into its own function, so that those patches can make use of it as well. Signed-off-by: Jonathan Tan <> Signed-off-by: Junio C Hamano <>
2017-10-09cleanup: fix possible overflow errors in binary searchDerrick Stolee
A common mistake when writing binary search is to allow possible integer overflow by using the simple average: mid = (min + max) / 2; Instead, use the overflow-safe version: mid = min + (max - min) / 2; This translation is safe since the operation occurs inside a loop conditioned on "min < max". The included changes were found using the following git grep: git grep '/ *2;' '*.c' Making this cleanup will prevent future review friction when a new binary search is contructed based on existing code. Signed-off-by: Derrick Stolee <> Reviewed-by: Jeff King <> Signed-off-by: Junio C Hamano <>
2017-08-09sha1_file: drop experimental GIT_USE_LOOKUP searchJeff King
Long ago in 628522ec14 (sha1-lookup: more memory efficient search in sorted list of SHA-1, 2007-12-29) we added sha1_entry_pos(), a binary search that uses the uniform distribution of sha1s to scale the selection of mid-points. As this was a performance experiment, we tied it to the GIT_USE_LOOKUP environment variable and never enabled it by default. This code was successful in reducing the number of steps in each search. But the overhead of the scaling ends up making it slower when the cache is warm. Here are best-of-five timings for running rev-list on linux.git, which will have to look up every object: $ time git rev-list --objects --all >/dev/null real 0m35.357s user 0m35.016s sys 0m0.340s $ time GIT_USE_LOOKUP=1 git rev-list --objects --all >/dev/null real 0m37.364s user 0m37.045s sys 0m0.316s The USE_LOOKUP version might have more benefit on a cold cache, as the time to fault in each page would dominate. But that would be for a single lookup. In practice, most operations tend to look up many objects, and the whole pack .idx will end up warm. It's possible that the code could be better optimized to compete with a naive binary search for the warm-cache case, and we could have the best of both worlds. But over the years nobody has done so, and this is largely dead code that is rarely run outside of the test suite. Let's drop it in the name of simplicity. This lets us remove sha1_entry_pos() entirely, as the .idx lookup code was the only caller. Note that sha1-lookup.c still contains sha1_pos(), which differs from sha1_entry_pos() in two ways: - it has a different interface; it uses a function pointer to access sha1 entries rather than a size/offset pair describing the table's memory layout - it only scales the initial selection of "mi", rather than each iteration of the search We can't get rid of this function, as it's called from several places. It may be that we could replace it with a simple binary search, but that's out of scope for this patch (and would need benchmarking). Signed-off-by: Jeff King <> Signed-off-by: Junio C Hamano <>
2014-10-01sha1-lookup: handle duplicates in sha1_pos()René Scharfe
If the first 18 bytes of the SHA1's of all entries are the same then sha1_pos() dies and reports that the lower and upper limits of the binary search were the same that this wasn't supposed to happen. This is wrong because the remaining two bytes could still differ. Furthermore: It wouldn't be a problem if they actually were the same, i.e. if all entries have the same SHA1. The code already handles duplicates just fine. Simply remove the erroneous check. Signed-off-by: Rene Scharfe <> Acked-by: Jeff King <> Signed-off-by: Junio C Hamano <>
2013-08-25sha1-lookup: handle duplicate keys with GIT_USE_LOOKUPJeff King
The sha1_entry_pos function tries to be smart about selecting the middle of a range for its binary search by looking at the value differences between the "lo" and "hi" constraints. However, it is unable to cope with entries with duplicate keys in the sorted list. We may hit a point in the search where both our "lo" and "hi" point to the same key. In this case, the range of values between our endpoints is 0, and trying to scale the difference between our key and the endpoints over that range is undefined (i.e., divide by zero). The current code catches this with an "assert(lov < hiv)". Moreover, after seeing that the first 20 byte of the key are the same, we will try to establish a value from the 21st byte. Which is nonsensical. Instead, we can detect the case that we are in a run of duplicates, and simply do a final comparison against any one of them (since they are all the same, it does not matter which). If the keys match, we have found our entry (or one of them, anyway). If not, then we know that we do not need to look further, as we must be in a run of the duplicate key. Signed-off-by: Jeff King <> Acked-by: Nicolas Pitre <> Signed-off-by: Junio C Hamano <>
2009-04-06sha1-lookup: fix up the assertion messageJunio C Hamano
Signed-off-by: Junio C Hamano <>
2009-04-05sha1-lookup: add new "sha1_pos" function to efficiently lookup sha1Christian Couder
This function has been copied from the "patch_pos" function in "patch-ids.c" but an additional parameter has been added. The new parameter is a function pointer, that is used to access the sha1 of an element in the table. Signed-off-by: Christian Couder <> Signed-off-by: Junio C Hamano <>
2008-04-09sha1-lookup: make selection of 'middle' less aggressiveJunio C Hamano
If we pick 'mi' between 'lo' and 'hi' at 50%, which was what the simple binary search did, we are halving the search space whether the entry at 'mi' is lower or higher than the target. The previous patch was about picking not the middle but closer to 'hi', when we know the target is a lot closer to 'hi' than it is to 'lo'. However, if it turns out that the entry at 'mi' is higher than the target, we would end up reducing the search space only by the difference between 'mi' and 'hi' (which by definition is less than 50% --- that was the whole point of not using the simple binary search), which made the search less efficient. And the risk of overshooting becomes very high, if we try to be too precise. This tweaks the selection of 'mi' to be a bit closer to the middle than we would otherwise pick to avoid the problem. Signed-off-by: Junio C Hamano <>
2008-04-09sha1-lookup: more memory efficient search in sorted list of SHA-1Junio C Hamano
Currently, when looking for a packed object from the pack idx, a simple binary search is used. A conventional binary search loop looks like this: unsigned lo, hi; do { unsigned mi = (lo + hi) / 2; int cmp = "entry pointed at by mi" minus "target"; if (!cmp) return mi; "mi is the wanted one" if (cmp > 0) hi = mi; "mi is larger than target" else lo = mi+1; "mi is smaller than target" } while (lo < hi); "did not find what we wanted" The invariants are: - When entering the loop, 'lo' points at a slot that is never above the target (it could be at the target), 'hi' points at a slot that is guaranteed to be above the target (it can never be at the target). - We find a point 'mi' between 'lo' and 'hi' ('mi' could be the same as 'lo', but never can be as high as 'hi'), and check if 'mi' hits the target. There are three cases: - if it is a hit, we have found what we are looking for; - if it is strictly higher than the target, we set it to 'hi', and repeat the search. - if it is strictly lower than the target, we update 'lo' to one slot after it, because we allow 'lo' to be at the target and 'mi' is known to be below the target. If the loop exits, there is no matching entry. When choosing 'mi', we do not have to take the "middle" but anywhere in between 'lo' and 'hi', as long as lo <= mi < hi is satisfied. When we somehow know that the distance between the target and 'lo' is much shorter than the target and 'hi', we could pick 'mi' that is much closer to 'lo' than (hi+lo)/2, which a conventional binary search would pick. This patch takes advantage of the fact that the SHA-1 is a good hash function, and as long as there are enough entries in the table, we can expect uniform distribution. An entry that begins with for example "deadbeef..." is much likely to appear much later than in the midway of a reasonably populated table. In fact, it can be expected to be near 87% (222/256) from the top of the table. This is a work-in-progress and has switches to allow easier experiments and debugging. Exporting GIT_USE_LOOKUP environment variable enables this code. On my admittedly memory starved machine, with a partial KDE repository (3.0G pack with 95M idx): $ GIT_USE_LOOKUP=t git log -800 --stat HEAD >/dev/null 3.93user 0.16system 0:04.09elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+55588minor)pagefaults 0swaps Without the patch, the numbers are: $ git log -800 --stat HEAD >/dev/null 4.00user 0.15system 0:04.17elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+60258minor)pagefaults 0swaps In the same repository: $ GIT_USE_LOOKUP=t git log -2000 HEAD >/dev/null 0.12user 0.00system 0:00.12elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+4241minor)pagefaults 0swaps Without the patch, the numbers are: $ git log -2000 HEAD >/dev/null 0.05user 0.01system 0:00.07elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+8506minor)pagefaults 0swaps There isn't much time difference, but the number of minor faults seems to show that we are touching much smaller number of pages, which is expected. Signed-off-by: Junio C Hamano <>