summaryrefslogtreecommitdiff
path: root/object-store.h
AgeCommit message (Collapse)Author
2021-07-08oidtree: a crit-bit tree for odb_loose_cacheEric Wong
This saves 8K per `struct object_directory', meaning it saves around 800MB in my case involving 100K alternates (half or more of those alternates are unlikely to hold loose objects). This is implemented in two parts: a generic, allocation-free `cbtree' and the `oidtree' wrapper on top of it. The latter provides allocation using alloc_state as a memory pool to improve locality and reduce free(3) overhead. Unlike oid-array, the crit-bit tree does not require sorting. Performance is bound by the key length, for oidtree that is fixed at sizeof(struct object_id). There's no need to have 256 oidtrees to mitigate the O(n log n) overhead like we did with oid-array. Being a prefix trie, it is natively suited for expanding short object IDs via prefix-limited iteration in `find_short_object_filename'. On my busy workstation, p4205 performance seems to be roughly unchanged (+/-8%). Startup with 100K total alternates with no loose objects seems around 10-20% faster on a hot cache. (800MB in memory savings means more memory for the kernel FS cache). The generic cbtree implementation does impose some extra overhead for oidtree in that it uses memcmp(3) on "struct object_id" so it wastes cycles comparing 12 extra bytes on SHA-1 repositories. I've not yet explored reducing this overhead, but I expect there are many places in our code base where we'd want to investigate this. More information on crit-bit trees: https://cr.yp.to/critbit.html Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-08make object_directory.loose_objects_subdir_seen a bitmapEric Wong
There's no point in using 8 bits per-directory when 1 bit will do. This saves us 224 bytes per object directory, which ends up being 22MB when dealing with 100K alternates. Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-08speed up alt_odb_usable() with many alternatesEric Wong
With many alternates, the duplicate check in alt_odb_usable() wastes many cycles doing repeated fspathcmp() on every existing alternate. Use a khash to speed up lookups by odb->path. Since the kh_put_* API uses the supplied key without duplicating it, we also take advantage of it to replace both xstrdup() and strbuf_release() in link_alt_odb_entry() with strbuf_detach() to avoid the allocation and copy. In a test repository with 50K alternates and each of those 50K alternates having one alternate each (for a total of 100K total alternates); this speeds up lookup of a non-existent blob from over 16 minutes to roughly 2.7 seconds on my busy workstation. Note: all underlying git object directories were small and unpacked with only loose objects and no packs. Having to load packs increases times significantly. Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23packfile: add kept-pack cache for find_kept_pack_entry()Jeff King
In a recent patch we added a function 'find_kept_pack_entry()' to look for an object only among kept packs. While this function avoids doing any lookup work in non-kept packs, it is still linear in the number of packs, since we have to traverse the linked list of packs once per object. Let's cache a reduced version of that list to save us time. Note that this cache will last the lifetime of the program. We could invalidate it on reprepare_packed_git(), but there's not much point in being rigorous here: - we might already fail to notice new .keep packs showing up after the program starts. We only reprepare_packed_git() when we fail to find an object. But adding a new pack won't cause that to happen. Somebody repacking could add a new pack and delete an old one, but most of the time we'd have a descriptor or mmap open to the old pack anyway, so we might not even notice. - in pack-objects we already cache the .keep state at startup, since 56dfeb6263 (pack-objects: compute local/ignore_pack_keep early, 2016-07-29). So this is just extending that concept further. - we don't have to worry about any packed_git being removed; we always keep the old structs around, even after reprepare_packed_git() We do defensively invalidate the cache in case the set of kept packs being asked for changes (e.g., only in-core kept packs were cached, but suddenly the caller also wants on-disk kept packs, too). In theory we could build all three caches and switch between them, but it's not necessary, since this patch (and series) never changes the set of kept packs that it wants to inspect from the cache. So that "optimization" is more about being defensive in the face of future changes than it is about asking for multiple kinds of kept packs in this patch. Here are p5303 results (as always, measured against the kernel): Test HEAD^ HEAD ----------------------------------------------------------------------------------------------- 5303.5: repack (1) 57.34(54.66+10.88) 56.98(54.36+10.98) -0.6% 5303.6: repack with kept (1) 57.38(54.83+10.49) 57.17(54.97+10.26) -0.4% 5303.11: repack (50) 71.70(88.99+4.74) 71.62(88.48+5.08) -0.1% 5303.12: repack with kept (50) 72.58(89.61+4.78) 71.56(88.80+4.59) -1.4% 5303.17: repack (1000) 217.19(491.72+14.25) 217.31(490.82+14.53) +0.1% 5303.18: repack with kept (1000) 246.12(520.07+14.93) 217.08(490.37+15.10) -11.8% and the --stdin-packs case, which scales a little bit better (although not by that much even at 1,000 packs): 5303.7: repack with --stdin-packs (1) 0.00(0.00+0.00) 0.00(0.00+0.00) = 5303.13: repack with --stdin-packs (50) 3.43(11.75+0.24) 3.43(11.69+0.30) +0.0% 5303.19: repack with --stdin-packs (1000) 130.50(307.15+7.66) 125.13(301.36+8.04) -4.1% Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-26packfile: prepare for the existence of '*.rev' filesTaylor Blau
Specify the format of the on-disk reverse index 'pack-*.rev' file, as well as prepare the code for the existence of such files. The reverse index maps from pack relative positions (i.e., an index into the array of object which is sorted by their offsets within the packfile) to their position within the 'pack-*.idx' file. Today, this is done by building up a list of (off_t, uint32_t) tuples for each object (the off_t corresponding to that object's offset, and the uint32_t corresponding to its position in the index). To convert between pack and index position quickly, this array of tuples is radix sorted based on its offset. This has two major drawbacks: First, the in-memory cost scales linearly with the number of objects in a pack. Each 'struct revindex_entry' is sizeof(off_t) + sizeof(uint32_t) + padding bytes for a total of 16. To observe this, force Git to load the reverse index by, for e.g., running 'git cat-file --batch-check="%(objectsize:disk)"'. When asking for a single object in a fresh clone of the kernel, Git needs to allocate 120+ MB of memory in order to hold the reverse index in memory. Second, the cost to sort also scales with the size of the pack. Luckily, this is a linear function since 'load_pack_revindex()' uses a radix sort, but this cost still must be paid once per pack per process. As an example, it takes ~60x longer to print the _size_ of an object as it does to print that entire object's _contents_: Benchmark #1: git.compile cat-file --batch <obj Time (mean ± σ): 3.4 ms ± 0.1 ms [User: 3.3 ms, System: 2.1 ms] Range (min … max): 3.2 ms … 3.7 ms 726 runs Benchmark #2: git.compile cat-file --batch-check="%(objectsize:disk)" <obj Time (mean ± σ): 210.3 ms ± 8.9 ms [User: 188.2 ms, System: 23.2 ms] Range (min … max): 193.7 ms … 224.4 ms 13 runs Instead, avoid computing and sorting the revindex once per process by writing it to a file when the pack itself is generated. The format is relatively straightforward. It contains an array of uint32_t's, the length of which is equal to the number of objects in the pack. The ith entry in this table contains the index position of the ith object in the pack, where "ith object in the pack" is determined by pack offset. One thing that the on-disk format does _not_ contain is the full (up to) eight-byte offset corresponding to each object. This is something that the in-memory revindex contains (it stores an off_t in 'struct revindex_entry' along with the same uint32_t that the on-disk format has). Omit it in the on-disk format, since knowing the index position for some object is sufficient to get a constant-time lookup in the pack-*.idx file to ask for an object's offset within the pack. This trades off between the on-disk size of the 'pack-*.rev' file for runtime to chase down the offset for some object. Even though the lookup is constant time, the constant is heavier, since it can potentially involve two pointer walks in v2 indexes (one to access the 4-byte offset table, and potentially a second to access the double wide offset table). Consider trying to map an object's pack offset to a relative position within that pack. In a cold-cache scenario, more page faults occur while switching between binary searching through the reverse index and searching through the *.idx file for an object's offset. Sure enough, with a cold cache (writing '3' into '/proc/sys/vm/drop_caches' after 'sync'ing), printing out the entire object's contents is still marginally faster than printing its size: Benchmark #1: git.compile cat-file --batch-check="%(objectsize:disk)" <obj >/dev/null Time (mean ± σ): 22.6 ms ± 0.5 ms [User: 2.4 ms, System: 7.9 ms] Range (min … max): 21.4 ms … 23.5 ms 41 runs Benchmark #2: git.compile cat-file --batch <obj >/dev/null Time (mean ± σ): 17.2 ms ± 0.7 ms [User: 2.8 ms, System: 5.5 ms] Range (min … max): 15.6 ms … 18.2 ms 45 runs (Numbers taken in the kernel after cheating and using the next patch to generate a reverse index). There are a couple of approaches to improve cold cache performance not pursued here: - We could include the object offsets in the reverse index format. Predictably, this does result in fewer page faults, but it triples the size of the file, while simultaneously duplicating a ton of data already available in the .idx file. (This was the original way I implemented the format, and it did show `--batch-check='%(objectsize:disk)'` winning out against `--batch`.) On the other hand, this increase in size also results in a large block-cache footprint, which could potentially hurt other workloads. - We could store the mapping from pack to index position in more cache-friendly way, like constructing a binary search tree from the table and writing the values in breadth-first order. This would result in much better locality, but the price you pay is trading O(1) lookup in 'pack_pos_to_index()' for an O(log n) one (since you can no longer directly index the table). So, neither of these approaches are taken here. (Thankfully, the format is versioned, so we are free to pursue these in the future.) But, cold cache performance likely isn't interesting outside of one-off cases like asking for the size of an object directly. In real-world usage, Git is often performing many operations in the revindex (i.e., asking about many objects rather than a single one). The trade-off is worth it, since we will avoid the vast majority of the cost of generating the revindex that the extra pointer chase will look like noise in the following patch's benchmarks. This patch describes the format and prepares callers (like in pack-revindex.c) to be able to read *.rev files once they exist. An implementation of the writer will appear in the next patch, and callers will gradually begin to start using the writer in the patches that follow after that. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-06sha1-file: introduce no-lazy-fetch has_object()Jonathan Tan
There have been a few bugs wherein Git fetches missing objects whenever the existence of an object is checked, even though it does not need to perform such a fetch. To resolve these bugs, we could look at all the places that has_object_file() (or a similar function) is used. As a first step, introduce a new function has_object() that checks for the existence of an object, with a default behavior of not fetching if the object is missing and the repository is a partial clone. As we verify each has_object_file() (or similar) usage, we can replace it with has_object(), and we will know that we are done when we can delete has_object_file() (and the other similar functions). Also, the new function has_object() has more appropriate defaults: besides not fetching, it also does not recheck packed storage. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-27packfile: compute and use the index CRC offsetbrian m. carlson
Both v2 pack index files and the v3 format specified as part of the NewHash work have similar data starting at the CRC table. Much of the existing code wants to read either this table or the offset entries following it, and in doing so computes the offset each time. In order to share as much code between v2 and v3, compute the offset of the CRC table and store it when the pack is opened. Use this value to compute offsets to not only the CRC table, but to the offset entries beyond it. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-30oid_array: rename source file from sha1-arrayJeff King
We renamed the actual data structure in 910650d2f8 (Rename sha1_array to oid_array, 2017-03-31), but the file is still called sha1-array. Besides being slightly confusing, it makes it more annoying to grep for leftover occurrences of "sha1" in various files, because the header is included in so many places. Let's complete the transition by renaming the source and header files (and fixing up a few comment references). I kept the "-" in the name, as that seems to be our style; cf. fc1395f4a4 (sha1_file.c: rename to use dash in file name, 2018-04-10). We also have oidmap.h and oidset.h without any punctuation, but those are "struct oidmap" and "struct oidset" in the code. We _could_ make this "oidarray" to match, but somehow it looks uglier to me because of the length of "array" (plus it would be a very invasive patch for little gain). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-24packed_object_info(): use object_id for returning delta baseJeff King
If a caller sets the object_info.delta_base_sha1 to a non-NULL pointer, we'll write the oid of the object's delta base to it. But we can increase our type safety by switching this to a real object_id struct. All of our callers are just pointing into the hash member of an object_id anyway, so there's no inconvenience. Note that we do still keep it as a pointer-to-struct, because the NULL sentinel value tells us whether the caller is even interested in the information. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-14Merge branch 'mt/use-passed-repo-more-in-funcs'Junio C Hamano
Some codepaths were given a repository instance as a parameter to work in the repository, but passed the_repository instance to its callees, which has been cleaned up (somewhat). * mt/use-passed-repo-more-in-funcs: sha1-file: allow check_object_signature() to handle any repo sha1-file: pass git_hash_algo to hash_object_file() sha1-file: pass git_hash_algo to write_object_file_prepare() streaming: allow open_istream() to handle any repo pack-check: use given repo's hash_algo at verify_packfile() cache-tree: use given repo's hash_algo at verify_one() diff: make diff_populate_filespec() honor its repo argument
2020-02-14Merge branch 'mt/threaded-grep-in-object-store'Junio C Hamano
Traditionally, we avoided threaded grep while searching in objects (as opposed to files in the working tree) as accesses to the object layer is not thread-safe. This limitation is getting lifted. * mt/threaded-grep-in-object-store: grep: use no. of cores as the default no. of threads grep: move driver pre-load out of critical section grep: re-enable threads in non-worktree case grep: protect packed_git [re-]initialization grep: allow submodule functions to run in parallel submodule-config: add skip_if_read option to repo_read_gitmodules() grep: replace grep_read_mutex by internal obj read lock object-store: allow threaded access to object reading replace-object: make replace operations thread-safe grep: fix racy calls in grep_objects() grep: fix race conditions at grep_submodule() grep: fix race conditions on userdiff calls
2020-02-12Merge branch 'jn/pretend-object-doc'Junio C Hamano
Warn programmers about pretend_object_file() that allows the code to tentatively use in-core objects. * jn/pretend-object-doc: sha1-file: document how to use pretend_object_file
2020-01-31sha1-file: pass git_hash_algo to hash_object_file()Matheus Tavares
Allow hash_object_file() to work on arbitrary repos by introducing a git_hash_algo parameter. Change callers which have a struct repository pointer in their scope to pass on the git_hash_algo from the said repo. For all other callers, pass on the_hash_algo, which was already being used internally at hash_object_file(). This functionality will be used in the following patch to make check_object_signature() be able to work on arbitrary repos (which, in turn, will be used to fix an inconsistency at object.c:parse_object()). Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-22Merge branch 'jt/sha1-file-remove-oi-skip-cached'Junio C Hamano
has_object_file() said "no" given an object registered to the system via pretend_object_file(), making it inconsistent with read_object_file(), causing lazy fetch to attempt fetching an empty tree from promisor remotes. * jt/sha1-file-remove-oi-skip-cached: sha1-file: remove OBJECT_INFO_SKIP_CACHED
2020-01-17object-store: allow threaded access to object readingMatheus Tavares
Allow object reading to be performed by multiple threads protecting it with an internal lock, the obj_read_mutex. The lock usage can be toggled with enable_obj_read_lock() and disable_obj_read_lock(). Currently, the functions which can be safely called in parallel are: read_object_file_extended(), repo_read_object_file(), read_object_file(), read_object_with_reference(), read_object(), oid_object_info() and oid_object_info_extended(). It's also possible to use obj_read_lock() and obj_read_unlock() to protect other sections that cannot execute in parallel with object reading. Probably there are many spots in the functions listed above that could be executed unlocked (and thus, in parallel). But, for now, we are most interested in allowing parallel access to zlib inflation. This is one of the sections where object reading spends most of the time in (e.g. up to one-third of git-grep's execution time in the chromium repo corresponds to inflation) and it's already thread-safe. So, to take advantage of that, the obj_read_mutex is released when calling git_inflate() and re-acquired right after, for every calling spot in oid_object_info_extended()'s call chain. We may refine this lock to also exploit other possible parallel spots in the future, but for now, threaded zlib inflation should already give great speedups for threaded object reading callers. Note that add_delta_base_cache() was also modified to skip adding already present entries to the cache. This wasn't possible before, but it would be now, with the parallel inflation. Take for example the following situation, where two threads - A and B - are executing the code at unpack_entry(): 1. Thread A is performing the decompression of a base O (which is not yet in the cache) at PHASE II. Thread B is simultaneously trying to unpack O, but just starting at PHASE I. 2. Since O is not yet in the cache, B will go to PHASE II to also perform the decompression. 3. When they finish decompressing, one of them will get the object reading mutex and go to PHASE III while the other waits for the mutex. Let’s say A got the mutex first. 4. Thread A will add O to the cache, go throughout the rest of PHASE III and return. 5. Thread B gets the mutex, also add O to the cache (if the check wasn't there) and returns. Finally, it is also important to highlight that the object reading lock can only ensure thread-safety in the mentioned functions thanks to two complementary mechanisms: the use of 'struct raw_object_store's replace_mutex, which guards sections in the object reading machinery that would otherwise be thread-unsafe; and the 'struct pack_window's inuse_cnt, which protects window reading operations (such as the one performed during the inflation of a packed object), allowing them to execute without the acquisition of the obj_read_mutex. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-17replace-object: make replace operations thread-safeMatheus Tavares
replace-object functions are very close to being thread-safe: the only current racy section is the lazy initialization at prepare_replace_object(). The following patches will protect some object reading operations to be called threaded, but before that, replace functions must be protected. To do so, add a mutex to struct raw_object_store and acquire it before lazy initializing the replace_map. This won't cause any noticeable performance drop as the mutex will no longer be used after the replace_map is initialized. Later, when the replace functions are called in parallel, thread debuggers might point our use of the added replace_map_initialized flag as a data race. However, as this boolean variable is initialized as false and it's only updated once, there's no real harm. It's perfectly fine if the value is updated right after a thread read it in replace-map.h:lookup_replace_object() (there'll only be a performance penalty for the affected threads at that moment). We could cease the debugger warning protecting the variable reading at the said function. However, this would negatively affect performance for all threads calling it, at any time, so it's not really worthy since the warning doesn't represent a real problem. Instead, to make sure we don't get false positives (at ThreadSanitizer, at least) an entry for the respective function is added to .tsan-suppressions. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-06sha1-file: document how to use pretend_object_fileJonathan Nieder
Like in-memory alternates, pretend_object_file contains a trap for the unwary: careless callers can use it to create references to an object that does not exist in the on-disk object store. Add a comment documenting how to use the function without risking such problems. The only current caller is blame, which uses pretend_object_file to create an in-memory commit representing the working tree state. Noticed during a discussion of how to safely use this function in operations like "git merge" which, unlike blame, are not read-only. Inspired-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-02sha1-file: remove OBJECT_INFO_SKIP_CACHEDJonathan Tan
In a partial clone, if a user provides the hash of the empty tree ("git mktree </dev/null" - for SHA-1, this is 4b825d...) to a command which requires that that object be parsed, for example: git diff-tree 4b825d <a non-empty tree> then Git will lazily fetch the empty tree, unnecessarily, because parsing of that object invokes repo_has_object_file(), which does not special-case the empty tree. Instead, teach repo_has_object_file() to consult find_cached_object() (which handles the empty tree), thus bringing it in line with the rest of the object-store-accessing functions. A cost is that repo_has_object_file() will now need to oideq upon each invocation, but that is trivial compared to the filesystem lookup or the pack index search required anyway. (And if find_cached_object() needs to do more because of previous invocations to pretend_object_file(), all the more reason to be consistent in whether we present cached objects.) As a historical note, the function now known as repo_read_object_file() was taught the empty tree in 346245a1bb ("hard-code the empty tree object", 2008-02-13), and the function now known as oid_object_info() was taught the empty tree in c4d9986f5f ("sha1_object_info: examine cached_object store too", 2011-02-07). repo_has_object_file() was never updated, perhaps due to oversight. The flag OBJECT_INFO_SKIP_CACHED, introduced later in dfdd4afcf9 ("sha1_file: teach sha1_object_info_extended more flags", 2017-06-26) and used in e83e71c5e1 ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26), was introduced to preserve this difference in empty-tree handling, but now it can be removed. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-03packfile.c: speed up loading lots of packfilesColin Stolley
When loading packfiles on start-up, we traverse the internal packfile list once per file to avoid reloading packfiles that have already been loaded. This check runs in quadratic time, so for poorly maintained repos with a large number of packfiles, it can be pretty slow. Add a hashmap containing the packfile names as we load them so that the average runtime cost of checking for already-loaded packs becomes constant. Add a perf test to p5303 to show speed-up. The existing p5303 test runtimes are dominated by other factors and do not show an appreciable speed-up. The new test in p5303 clearly exposes a speed-up in bad cases. In this test we create 10,000 packfiles and measure the start-up time of git rev-parse, which does little else besides load in the packs. Here are the numbers for the new p5303 test: Test HEAD^ HEAD --------------------------------------------------------------------- 5303.12: load 10,000 packs 1.03(0.92+0.10) 0.12(0.02+0.09) -88.3% Signed-off-by: Colin Stolley <cstolley@runbox.com> Helped-by: Jeff King <peff@peff.net> [jc: squashed the change to call hashmap in install_packed_git() by peff] Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-07-19Merge branch 'jk/check-connected-with-alternates'Junio C Hamano
The tips of refs from the alternate object store can be used as starting point for reachability computation now. * jk/check-connected-with-alternates: check_everything_connected: assume alternate ref tips are valid object-store.h: move for_each_alternate_ref() from transport.h
2019-07-01object-store.h: move for_each_alternate_ref() from transport.hJeff King
There's nothing inherently transport-related about enumerating the alternate ref tips. The code has lived in transport.[ch] because the only use so far had been advertising available tips during transport. But it could be used for more, and a future patch will teach rev-list to access these refs. Let's move it alongside the other alt-odb code, declaring it in object-store.h with the implementation in sha1-file.c. This lets us drop the inclusion of transport.h from receive-pack, which perhaps shows how it was misplaced (though receive-pack is about transporting objects, transport.h is mostly about the client side). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-17Merge branch 'ds/object-info-for-prefetch-fix'Junio C Hamano
Code cleanup and futureproof. * ds/object-info-for-prefetch-fix: sha1-file: split OBJECT_INFO_FOR_PREFETCH
2019-05-28sha1-file: split OBJECT_INFO_FOR_PREFETCHDerrick Stolee
The OBJECT_INFO_FOR_PREFETCH bitflag was added to sha1-file.c in 0f4a4fb1 (sha1-file: support OBJECT_INFO_FOR_PREFETCH, 2019-03-29) and is used to prevent the fetch_objects() method when enabled. However, there is a problem with the current use. The definition of OBJECT_INFO_FOR_PREFETCH is given by adding 32 to OBJECT_INFO_QUICK. This is clearly stated above the definition (in a comment) that this is so OBJECT_INFO_FOR_PREFETCH implies OBJECT_INFO_QUICK. The problem is that using "flag & OBJECT_INFO_FOR_PREFETCH" means that OBJECT_INFO_QUICK also implies OBJECT_INFO_FOR_PREFETCH. Split out the single bit from OBJECT_INFO_FOR_PREFETCH into a new OBJECT_INFO_SKIP_FETCH_OBJECT as the single bit and keep OBJECT_INFO_FOR_PREFETCH as the union of two flags. This allows a clearer use of flag checking while also keeping the implication of OBJECT_INFO_QUICK. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-19Merge branch 'ds/midx-too-many-packs'Junio C Hamano
The code to generate the multi-pack idx file was not prepared to see too many packfiles and ran out of open file descriptor, which has been corrected. * ds/midx-too-many-packs: midx: add packs to packed_git linked list midx: pass a repository pointer
2019-05-13Merge branch 'dl/no-extern-in-func-decl'Junio C Hamano
Mechanically and systematically drop "extern" from function declarlation. * dl/no-extern-in-func-decl: *.[ch]: manually align parameter lists *.[ch]: remove extern from function declarations using sed *.[ch]: remove extern from function declarations using spatch
2019-05-07midx: add packs to packed_git linked listDerrick Stolee
The multi-pack-index allows searching for objects across multiple packs using one object list. The original design gains many of these performance benefits by keeping the packs in the multi-pack-index out of the packed_git list. Unfortunately, this has one major drawback. If the multi-pack-index covers thousands of packs, and a command loads many of those packs, then we can hit the limit for open file descriptors. The close_one_pack() method is used to limit this resource, but it only looks at the packed_git list, and uses an LRU cache to prevent thrashing. Instead of complicating this close_one_pack() logic to include direct references to the multi-pack-index, simply add the packs opened by the multi-pack-index to the packed_git list. This immediately solves the file-descriptor limit problem, but requires some extra steps to avoid performance issues or other problems: 1. Create a multi_pack_index bit in the packed_git struct that is one if and only if the pack was loaded from a multi-pack-index. 2. Skip packs with the multi_pack_index bit when doing object lookups and abbreviations. These algorithms already check the multi-pack-index before the packed_git struct. This has a very small performance hit, as we need to walk more packed_git structs. This is acceptable, since these operations run binary search on the other packs, so this walk-and-ignore logic is very fast by comparison. 3. When closing a multi-pack-index file, do not close its packs, as those packs will be closed using close_all_packs(). In some cases, such as 'git repack', we run 'close_midx()' without also closing the packs, so we need to un-set the multi_pack_index bit in those packs. This is necessary, and caught by running t6501-freshen-objects.sh with GIT_TEST_MULTI_PACK_INDEX=1. To manually test this change, I inserted trace2 logging into close_pack_fd() and set pack_max_fds to 10, then ran 'git rev-list --all --objects' on a copy of the Git repo with 300+ pack-files and a multi-pack-index. The logs verified the packs are closed as we read them beyond the file descriptor limit. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-05*.[ch]: manually align parameter listsDenton Liu
In previous patches, extern was mechanically removed from function declarations without care to formatting, causing parameter lists to be misaligned. Manually format changed sections such that the parameter lists should be realigned. Viewing this patch with 'git diff -w' should produce no output. Signed-off-by: Denton Liu <liu.denton@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-05*.[ch]: remove extern from function declarations using spatchDenton Liu
There has been a push to remove extern from function declarations. Remove some instances of "extern" for function declarations which are caught by Coccinelle. Note that Coccinelle has some difficulty with processing functions with `__attribute__` or varargs so some `extern` declarations are left behind to be dealt with in a future patch. This was the Coccinelle patch used: @@ type T; identifier f; @@ - extern T f(...); and it was run with: $ git ls-files \*.{c,h} | grep -v ^compat/ | xargs spatch --sp-file contrib/coccinelle/noextern.cocci --in-place Files under `compat/` are intentionally excluded as some are directly copied from external sources and we should avoid churning them as much as possible. Signed-off-by: Denton Liu <liu.denton@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-25Merge branch 'jt/batch-fetch-blobs-in-diff'Junio C Hamano
While running "git diff" in a lazy clone, we can upfront know which missing blobs we will need, instead of waiting for the on-demand machinery to discover them one by one. Aim to achieve better performance by batching the request for these promised blobs. * jt/batch-fetch-blobs-in-diff: diff: batch fetching of missing blobs sha1-file: support OBJECT_INFO_FOR_PREFETCH
2019-04-01sha1-file: support OBJECT_INFO_FOR_PREFETCHJonathan Tan
Teach oid_object_info_extended() to support a new flag that inhibits fetching of missing objects. This is equivalent to setting fetch_is_missing to 0, calling oid_object_info_extended(), then setting fetch_if_missing to whatever it was before. Update unpack-trees.c to use this new flag instead of repeatedly setting fetch_if_missing. This new flag complicates things slightly in that there are now 2 ways to do the same thing. But this eliminates the need to repeatedly set a global variable, and more importantly, allows prefetching to be done in parallel (in the future); hence, this patch. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-01object-store: rename and expand packed_git's sha1 memberbrian m. carlson
This member is used to represent the pack checksum of the pack in question. Expand this member to be GIT_MAX_RAWSZ bytes in length so it works with longer hashes and rename it to be "hash" instead of "sha1". This transformation was made with a change to the definition and the following semantic patch: @@ struct packed_git *E1; @@ - E1->sha1 + E1->hash @@ struct packed_git E1; @@ - E1.sha1 + E1.hash Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-02-07Merge branch 'jk/loose-object-cache-oid'Junio C Hamano
Code clean-up. * jk/loose-object-cache-oid: prefer "hash mismatch" to "sha1 mismatch" sha1-file: avoid "sha1 file" for generic use in messages sha1-file: prefer "loose object file" to "sha1 file" in messages sha1-file: drop has_sha1_file() convert has_sha1_file() callers to has_object_file() sha1-file: convert pass-through functions to object_id sha1-file: modernize loose header/stream functions sha1-file: modernize loose object file functions http: use struct object_id instead of bare sha1 update comment references to sha1_object_info() sha1-file: fix outdated sha1 comment references
2019-02-05Merge branch 'sb/more-repo-in-api'Junio C Hamano
The in-core repository instances are passed through more codepaths. * sb/more-repo-in-api: (23 commits) t/helper/test-repository: celebrate independence from the_repository path.h: make REPO_GIT_PATH_FUNC repository agnostic commit: prepare free_commit_buffer and release_commit_memory for any repo commit-graph: convert remaining functions to handle any repo submodule: don't add submodule as odb for push submodule: use submodule repos for object lookup pretty: prepare format_commit_message to handle arbitrary repositories commit: prepare logmsg_reencode to handle arbitrary repositories commit: prepare repo_unuse_commit_buffer to handle any repo commit: prepare get_commit_buffer to handle any repo commit-reach: prepare in_merge_bases[_many] to handle any repo commit-reach: prepare get_merge_bases to handle any repo commit-reach.c: allow get_merge_bases_many_0 to handle any repo commit-reach.c: allow remove_redundant to handle any repo commit-reach.c: allow merge_bases_many to handle any repo commit-reach.c: allow paint_down_to_common to handle any repo commit: allow parse_commit* to handle any repo object: parse_object to honor its repository argument object-store: prepare has_{sha1, object}_file to handle any repo object-store: prepare read_object_file to deal with any repo ...
2019-01-08sha1-file: drop has_sha1_file()Jeff King
There are no callers left of has_sha1_file() or its with_flags() variant. Let's drop them, and convert has_object_file() from a wrapper into the "real" function. Ironically, the sha1 variant was just copying into an object_id internally, so the resulting code is actually shorter! We can also drop the coccinelle rules for catching has_sha1_file() callers. Since the function no longer exists, the compiler will do that for us. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-08object-store: retire odb_load_loose_cache()René Scharfe
Inline odb_load_loose_cache() into its only remaining caller, odb_loose_cache(). The latter offers a nicer interface for loading the cache, as it doesn't require callers to deal with fanout directory numbers directly. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-08object-store: use one oid_array per subdirectory for loose cacheRené Scharfe
The loose objects cache is filled one subdirectory at a time as needed. It is stored in an oid_array, which has to be resorted after each add operation. So when querying a wide range of objects, the partially filled array needs to be resorted up to 255 times, which takes over 100 times longer than sorting once. Use one oid_array for each subdirectory. This ensures that entries have to only be sorted a single time. It also avoids eight binary search steps for each cache lookup as a small bonus. The cache is used for collision checks for the log placeholders %h, %t and %p, and we can see the change speeding them up in a repository with ca. 100 objects per subdirectory: $ git count-objects 26733 objects, 68808 kilobytes Test HEAD^ HEAD -------------------------------------------------------------------- 4205.1: log with %H 0.51(0.47+0.04) 0.51(0.49+0.02) +0.0% 4205.2: log with %h 0.84(0.82+0.02) 0.60(0.57+0.03) -28.6% 4205.3: log with %T 0.53(0.49+0.04) 0.52(0.48+0.03) -1.9% 4205.4: log with %t 0.84(0.80+0.04) 0.60(0.59+0.01) -28.6% 4205.5: log with %P 0.52(0.48+0.03) 0.51(0.50+0.01) -1.9% 4205.6: log with %p 0.85(0.78+0.06) 0.61(0.56+0.05) -28.2% 4205.7: log with %h-%h-%h 0.96(0.92+0.03) 0.69(0.64+0.04) -28.1% Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-08sha1-file: modernize loose object file functionsJeff King
The loose object access code in sha1-file.c is some of the oldest in Git, and could use some modernizing. It mostly uses "unsigned char *" for object ids, which these days should be "struct object_id". It also uses the term "sha1_file" in many functions, which is confusing. The term "loose_objects" is much better. It clearly distinguishes them from packed objects (which didn't even exist back when the name "sha1_file" came into being). And it also distinguishes it from the checksummed-file concept in csum-file.c (which until recently was actually called "struct sha1file"!). This patch converts the functions {open,close,map,stat}_sha1_file() into open_loose_object(), etc, and switches their sha1 arguments for object_id structs. Similarly, path functions like fill_sha1_path() become fill_loose_path() and use object_ids. The function sha1_loose_object_info() already says "loose", so we can just drop the "sha1" (and teach it to use object_id). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-08object-store: factor out odb_clear_loose_cache()René Scharfe
Add and use a function for emptying the loose object cache, so callers don't have to know any of its implementation details. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-08object-store: factor out odb_loose_cache()René Scharfe
Add and use a function for loading the entries of a loose object subdirectory for a given object ID. It frees callers from deriving the fanout key; they can use the returned oid_array reference for lookups or forward range scans. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-14object-store: prepare has_{sha1, object}_file to handle any repoStefan Beller
Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-14object-store: prepare read_object_file to deal with any repoStefan Beller
As read_object_file is a widely used function (which is also regularly used in new code in flight between master..pu), changing its signature is painful is hard, as other series in flight rely on the original signature. It would burden the maintainer if we'd just change the signature. Introduce repo_read_object_file which takes the repository argument, and hide the original read_object_file as a macro behind NO_THE_REPOSITORY_COMPATIBILITY_MACROS, similar to e675765235 (diff.c: remove implicit dependency on the_index, 2018-09-21) Add a coccinelle patch to convert existing callers, but do not apply the resulting patch to keep the diff of this patch small. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-14object-store: allow read_object_file_extended to read from any repoStefan Beller
read_object_file_extended is not widely used, so migrate it all at once. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-13sha1-file: use loose object cache for quick existence checkJeff King
In cases where we expect to ask has_sha1_file() about a lot of objects that we are not likely to have (e.g., during fetch negotiation), we already use OBJECT_INFO_QUICK to sacrifice accuracy (due to racing with a simultaneous write or repack) for speed (we avoid re-scanning the pack directory). However, even checking for loose objects can be expensive, as we will stat() each one. On many systems this cost isn't too noticeable, but stat() can be particularly slow on some operating systems, or due to network filesystems. Since the QUICK flag already tells us that we're OK with a slightly stale answer, we can use that as a cue to look in our in-memory cache of each object directory. That basically trades an in-memory binary search for a stat() call. Note that it is possible for this to actually be _slower_. We'll do a full readdir() to fill the cache, so if you have a very large number of loose objects and a very small number of lookups, that readdir() may end up more expensive. This shouldn't be a big deal in practice. If you have a large number of reachable loose objects, you'll already run into performance problems (which you should remedy by repacking). You may have unreachable objects which wouldn't otherwise impact performance. Usually these would go away with the prune step of "git gc", but they may be held for up to 2 weeks in the default configuration. So it comes down to how many such objects you might reasonably expect to have, how much slower is readdir() on N entries versus M stat() calls (and here we really care about the syscall backing readdir(), like getdents() on Linux, but I'll just call this readdir() below). If N is much smaller than M (a typical packed repo), we know this is a big win (few readdirs() followed by many uses of the resulting cache). When N and M are similar in size, it's also a win. We care about the latency of making a syscall, and readdir() should be giving us many values in a single call. How many? On Linux, running "strace -e getdents ls" shows a 32k buffer getting 512 entries per call (which is 64 bytes per entry; the name itself is 38 bytes, plus there are some other fields). So we can imagine that this is always a win as long as the number of loose objects in the repository is a factor of 500 less than the number of lookups you make. It's hard to auto-tune this because we don't generally know up front how many lookups we're going to do. But it's unlikely for this to perform significantly worse. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-13object-store: provide helpers for loose_objects_cacheJeff King
Our object_directory struct has a loose objects cache that all users of the struct can see. But the only one that knows how to load the cache is find_short_object_filename(). Let's extract that logic in to a reusable function. While we're at it, let's also reset the cache when we re-read the object directories. This shouldn't have an impact on performance, as re-reads are meant to be rare (and are already expensive, so we avoid them with things like OBJECT_INFO_QUICK). Since the cache is already meant to be an approximation, it's tempting to skip even this bit of safety. But it's necessary to allow more code to use it. For instance, fetch-pack explicitly re-reads the object directory after performing its fetch, and would be confused if we didn't clear the cache. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-13sha1-file: use an object_directory for the main object dirJeff King
Our handling of alternate object directories is needlessly different from the main object directory. As a result, many places in the code basically look like this: do_something(r->objects->objdir); for (odb = r->objects->alt_odb_list; odb; odb = odb->next) do_something(odb->path); That gets annoying when do_something() is non-trivial, and we've resorted to gross hacks like creating fake alternates (see find_short_object_filename()). Instead, let's give each raw_object_store a unified list of object_directory structs. The first will be the main store, and everything after is an alternate. Very few callers even care about the distinction, and can just loop over the whole list (and those who care can just treat the first element differently). A few observations: - we don't need r->objects->objectdir anymore, and can just mechanically convert that to r->objects->odb->path - object_directory's path field needs to become a real pointer rather than a FLEX_ARRAY, in order to fill it with expand_base_dir() - we'll call prepare_alt_odb() earlier in many functions (i.e., outside of the loop). This may result in us calling it even when our function would be satisfied looking only at the main odb. But this doesn't matter in practice. It's not a very expensive operation in the first place, and in the majority of cases it will be a noop. We call it already (and cache its results) in prepare_packed_git(), and we'll generally check packs before loose objects. So essentially every program is going to call it immediately once per program. Arguably we should just prepare_alt_odb() immediately upon setting up the repository's object directory, which would save us sprinkling calls throughout the code base (and forgetting to do so has been a source of subtle bugs in the past). But I've stopped short of that here, since there are already a lot of other moving parts in this patch. - Most call sites just get shorter. The check_and_freshen() functions are an exception, because they have entry points to handle local and nonlocal directories separately. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-13handle alternates paths the same as the main object dirJeff King
When we generate loose file paths for the main object directory, the caller provides a buffer to loose_object_path (formerly sha1_file_name). The callers generally keep their own static buffer to avoid excessive reallocations. But for alternate directories, each struct carries its own scratch buffer. This is needlessly different; let's unify them. We could go either direction here, but this patch moves the alternates struct over to the main directory style (rather than vice-versa). Technically the alternates style is more efficient, as it avoids rewriting the object directory name on each call. But this is unlikely to matter in practice, as we avoid reallocations either way (and nobody has ever noticed or complained that the main object directory is copying a few extra bytes before making a much more expensive system call). And this has the advantage that the reusable buffers are tied to particular calls, which makes the invalidation rules simpler (for example, the return value from stat_sha1_file() used to be invalidated by basically any other object call, but now it is affected only by other calls to stat_sha1_file()). We do steal the trick from alt_sha1_path() of returning a pointer to the filled buffer, which makes a few conversions more convenient. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-13sha1_file_name(): overwrite buffer instead of appendingJeff King
The sha1_file_name() function is used to generate the path to a loose object in the object directory. It doesn't make much sense for it to append, since the the path we write may be absolute (i.e., you cannot reliably build up a path with it). Because many callers use it with a static buffer, they have to strbuf_reset() manually before each call (and the other callers always use an empty buffer, so they don't care either way). Let's handle this automatically. Since we're changing the semantics, let's take the opportunity to give it a more hash-neutral name (which will also catch any callers from topics in flight). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-13rename "alternate_object_database" to "object_directory"Jeff King
In preparation for unifying the handling of alt odb's and the normal repo object directory, let's use a more neutral name. This patch is purely mechanical, swapping the type name, and converting any variables named "alt" to "odb". There should be no functional change, but it will reduce the noise in subsequent diffs. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-20packfile: add all_packs listDerrick Stolee
If a repo contains a multi-pack-index, then the packed_git list does not contain the packfiles that are covered by the multi-pack-index. This is important for doing object lookups, abbreviations, and approximating object count. However, there are many operations that really want to iterate over all packfiles. Create a new 'all_packs' linked list that contains this list, starting with the packfiles in the multi-pack-index and then continuing along the packed_git linked list. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-20Sync 'ds/multi-pack-index' to v2.19.0-rc0Junio C Hamano
* ds/multi-pack-index: (23 commits) midx: clear midx on repack packfile: skip loading index if in multi-pack-index midx: prevent duplicate packfile loads midx: use midx in approximate_object_count midx: use existing midx when writing new one midx: use midx in abbreviation calculations midx: read objects from multi-pack-index config: create core.multiPackIndex setting midx: write object offsets midx: write object id fanout chunk midx: write object ids in a chunk midx: sort and deduplicate objects from packfiles midx: read pack names into array multi-pack-index: write pack names in chunk multi-pack-index: read packfile list packfile: generalize pack directory list t5319: expand test data multi-pack-index: load into memory midx: write header information to lockfile multi-pack-index: add 'write' verb ...