summaryrefslogtreecommitdiff
path: root/builtin/index-pack.c
AgeCommit message (Collapse)Author
26 hoursMerge branch 'ab/unbundle-progress'Junio C Hamano
Add progress display to "git bundle unbundle". * ab/unbundle-progress: bundle: show progress on "unbundle" index-pack: add --progress-title option bundle API: change "flags" to be "extra_index_pack_args" bundle API: start writing API documentation
26 hoursMerge branch 'tb/pack-finalize-ordering'Junio C Hamano
The order in which various files that make up a single (conceptual) packfile has been reevaluated and straightened up. This matters in correctness, as an incomplete set of files must not be shown to a running Git. * tb/pack-finalize-ordering: pack-objects: rename .idx files into place after .bitmap files pack-write: split up finish_tmp_packfile() function builtin/index-pack.c: move `.idx` files into place last index-pack: refactor renaming in final() builtin/repack.c: move `.idx` files into place last pack-write.c: rename `.idx` files after `*.rev` pack-write: refactor renaming in finish_tmp_packfile() bulk-checkin.c: store checksum directly pack.h: line-wrap the definition of finish_tmp_packfile()
12 daysbuiltin/index-pack.c: move `.idx` files into place lastTaylor Blau
In a similar spirit as preceding patches to `git repack` and `git pack-objects`, fix the identical problem in `git index-pack`. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
12 daysindex-pack: refactor renaming in final()Ævar Arnfjörð Bjarmason
Refactor the renaming in final() into a helper function, this is similar in spirit to a preceding refactoring of finish_tmp_packfile() in pack-write.c. Before e37d0b8730b (builtin/index-pack.c: write reverse indexes, 2021-01-25) it probably wasn't worth it to have this sort of helper, due to the differing "else if" case for "pack" files v.s. "idx" files. But since we've got "rev" as well now, let's do the renaming via a helper, this is both a net decrease in lines, and improves the readability, since we can easily see at a glance that the logic for writing these three types of files is exactly the same, aside from the obviously differing cases of "*final_name" being NULL, and "make_read_only_if_same" being different. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-07index-pack: add --progress-title optionÆvar Arnfjörð Bjarmason
Add a --progress-title option to index-pack, when data is piped into index-pack its progress is a proxy for whatever's feeding it data. This option will allow us to set a more relevant progress bar title in "git bundle unbundle", and is also used in my "bundle-uri" RFC patches[1] by a new caller in fetch-pack.c. The code change in cmd_index_pack() won't handle "--progress-title=xyz", only "--progress-title xyz", and the "(i+1)" style (as opposed to "i + 1") is a bit odd. Not using the "--long-option=value" style is inconsistent with existing long options handled by cmd_index_pack(), but makes the code that needs to call it better (two strvec_push(), instead of needing a strvec_pushf()). Since the option is internal-only the inconsistency shouldn't matter. I'm copying the pattern to handle it as-is from the handling of the existing "-o" option in the same function, see 9cf6d3357aa (Add git-index-pack utility, 2005-10-12) for its addition. That's a short option, but the code to implement the two is the same in functionality and style. Eventually we'd like to migrate all of this this to parse_options(), which would make these differences in behavior go away. 1. https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-25use xopen() to handle fatal open(2) failuresRené Scharfe
Add and apply a semantic patch for using xopen() instead of calling open(2) and die() or die_errno() explicitly. This makes the error messages more consistent and shortens the code. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-12*.c static functions: don't forward-declare __attribute__Ævar Arnfjörð Bjarmason
9cf6d3357aa (Add git-index-pack utility, 2005-10-12) and 466dbc42f58 (receive-pack: Send internal errors over side-band #2, 2010-02-10) we added these static functions and forward-declared their __attribute__((printf)). I think this may have been to work around some compiler limitation at the time, but in any case we have a lot of code that uses the briefer way of declaring these that I'm using here, so if we had any such issues with compilers we'd have seen them already. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27Use the final_oid_fn to finalize hashing of object IDsbrian m. carlson
When we're hashing a value which is going to be an object ID, we want to zero-pad that value if necessary. To do so, use the final_oid_fn instead of the final_fn anytime we're going to create an object ID to ensure we perform this operation. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27Always use oidread to read into struct object_idbrian m. carlson
In the future, we'll want oidread to automatically set the hash algorithm member for an object ID we read into it, so ensure we use oidread instead of hashcpy everywhere we're copying a hash value into a struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-07Merge branch 'ab/fsck-api-cleanup'Junio C Hamano
Fsck API clean-up. * ab/fsck-api-cleanup: fetch-pack: use new fsck API to printing dangling submodules fetch-pack: use file-scope static struct for fsck_options fetch-pack: don't needlessly copy fsck_options fsck.c: move gitmodules_{found,done} into fsck_options fsck.c: add an fsck_set_msg_type() API that takes enums fsck.c: pass along the fsck_msg_id in the fsck_error callback fsck.[ch]: move FOREACH_FSCK_MSG_ID & fsck_msg_id from *.c to *.h fsck.c: give "FOREACH_MSG_ID" a more specific name fsck.c: undefine temporary STR macro after use fsck.c: call parse_msg_type() early in fsck_set_msg_type() fsck.h: re-order and re-assign "enum fsck_msg_type" fsck.h: move FSCK_{FATAL,INFO,ERROR,WARN,IGNORE} into an enum fsck.c: refactor fsck_msg_type() to limit scope of "int msg_type" fsck.c: rename remaining fsck_msg_id "id" to "msg_id" fsck.c: remove (mostly) redundant append_msg_id() function fsck.c: rename variables in fsck_set_msg_type() for less confusion fsck.h: use "enum object_type" instead of "int" fsck.h: use designed initializers for FSCK_OPTIONS_{DEFAULT,STRICT} fsck.c: refactor and rename common config callback
2021-03-29fetch-pack: use new fsck API to printing dangling submodulesÆvar Arnfjörð Bjarmason
Refactor the check added in 5476e1efde (fetch-pack: print and use dangling .gitmodules, 2021-02-22) to make use of us now passing the "msg_id" to the user defined "error_func". We can now compare against the FSCK_MSG_GITMODULES_MISSING instead of parsing the generated message. Let's also replace register_found_gitmodules() with directly manipulating the "gitmodules_found" member. A recent commit moved it into "fsck_options" so we could do this here. I'm sticking this callback in fsck.c. Perhaps in the future we'd like to accumulate such callbacks into another file (maybe fsck-cb.c, similar to parse-options-cb.c?), but while we've got just the one let's just put it into fsck.c. A better alternative in this case would be some library some more obvious library shared by fetch-pack.c ad builtin/index-pack.c, but there isn't such a thing. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-29fetch-pack: don't needlessly copy fsck_optionsÆvar Arnfjörð Bjarmason
Change the behavior of the .gitmodules validation added in 5476e1efde (fetch-pack: print and use dangling .gitmodules, 2021-02-22) so we're using one "fsck_options". I found that code confusing to read. One might think that not setting up the error_func earlier means that we're relying on the "error_func" not being set in some code in between the two hunks being modified here. But we're not, all we're doing in the rest of "cmd_index_pack()" is further setup by calling fsck_set_msg_types(), and assigning to do_fsck_object. So there was no reason in 5476e1efde to make a shallow copy of the fsck_options struct before setting error_func. Let's just do this setup at the top of the function, along with the "walk" assignment. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-29fsck.c: pass along the fsck_msg_id in the fsck_error callbackÆvar Arnfjörð Bjarmason
Change the fsck_error callback to also pass along the fsck_msg_id. Before this change the only way to get the message id was to parse it back out of the "message". Let's pass it down explicitly for the benefit of callers that might want to use it, as discussed in [1]. Passing the msg_type is now redundant, as you can always get it back from the msg_id, but I'm not changing that convention. It's really common to need the msg_type, and the report() function itself (which calls "fsck_error") needs to call fsck_msg_type() to discover it. Let's not needlessly re-do that work in the user callback. 1. https://lore.kernel.org/git/87blcja2ha.fsf@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-29fsck.h: move FSCK_{FATAL,INFO,ERROR,WARN,IGNORE} into an enumÆvar Arnfjörð Bjarmason
Move the FSCK_{FATAL,INFO,ERROR,WARN,IGNORE} defines into a new fsck_msg_type enum. These defines were originally introduced in: - ba002f3b28a (builtin-fsck: move common object checking code to fsck.c, 2008-02-25) - f50c4407305 (fsck: disallow demoting grave fsck errors to warnings, 2015-06-22) - efaba7cc77f (fsck: optionally ignore specific fsck issues completely, 2015-06-22) - f27d05b1704 (fsck: allow upgrading fsck warnings to errors, 2015-06-22) The reason these were defined in two different places is because we use FSCK_{IGNORE,INFO,FATAL} only in fsck.c, but FSCK_{ERROR,WARN} are used by external callbacks. Untangling that would take some more work, since we expose the new "enum fsck_msg_type" to both. Similar to "enum object_type" it's not worth structuring the API in such a way that only those who need FSCK_{ERROR,WARN} pass around a different type. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-29fsck.h: use "enum object_type" instead of "int"Ævar Arnfjörð Bjarmason
Change the fsck_walk_func to use an "enum object_type" instead of an "int" type. The types are compatible, and ever since this was added in 355885d5315 (add generic, type aware object chain walker, 2008-02-25) we've used entries from object_type (OBJ_BLOB etc.). So this doesn't really change anything as far as the generated code is concerned, it just gives the compiler more information and makes this easier to read. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-14use CALLOC_ARRAYRené Scharfe
Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-01Merge branch 'jt/transfer-fsck-across-packs'Junio C Hamano
The approach to "fsck" the incoming objects in "index-pack" is attractive for performance reasons (we have them already in core, inflated and ready to be inspected), but fundamentally cannot be applied fully when we receive more than one pack stream, as a tree object in one pack may refer to a blob object in another pack as ".gitmodules", when we want to inspect blobs that are used as ".gitmodules" file, for example. Teach "index-pack" to emit objects that must be inspected later and check them in the calling "fetch-pack" process. * jt/transfer-fsck-across-packs: fetch-pack: print and use dangling .gitmodules fetch-pack: with packfile URIs, use index-pack arg http-fetch: allow custom index-pack args http: allow custom index-pack args
2021-02-22fetch-pack: print and use dangling .gitmodulesJonathan Tan
Teach index-pack to print dangling .gitmodules links after its "keep" or "pack" line instead of declaring an error, and teach fetch-pack to check such lines printed. This allows the tree side of the .gitmodules link to be in one packfile and the blob side to be in another without failing the fsck check, because it is now fetch-pack which checks such objects after all packfiles have been downloaded and indexed (and not index-pack on an individual packfile, as it is before this commit). Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-26t: support GIT_TEST_WRITE_REV_INDEXTaylor Blau
Add a new option that unconditionally enables the pack.writeReverseIndex setting in order to run the whole test suite in a mode that generates on-disk reverse indexes. Additionally, enable this mode in the second run of tests under linux-gcc in 'ci/run-build-and-tests.sh'. Once on-disk reverse indexes are proven out over several releases, we can change the default value of that configuration to 'true', and drop this patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-26builtin/index-pack.c: write reverse indexesTaylor Blau
Teach 'git index-pack' to optionally write and verify reverse index with '--[no-]rev-index', as well as respecting the 'pack.writeReverseIndex' configuration option. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-26builtin/index-pack.c: allow stripping arbitrary extensionsTaylor Blau
To derive the filename for a .idx file, 'git index-pack' uses derive_filename() to strip the '.pack' suffix and add the new suffix. Prepare for stripping off suffixes other than '.pack' by making the suffix to strip a parameter of derive_filename(). In order to make this consistent with the "suffix" parameter which does not begin with a ".", an additional check in derive_filename. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-04object-file.c: rename from sha1-file.cMartin Ågren
Drop the last remnant of "sha1" in this file and rename it to reflect that we're not just able to handle SHA-1 these days. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-16compute pack .idx byte offsets using size_tJeff King
A pack and its matching .idx file are limited to 2^32 objects, because the pack format contains a 32-bit field to store the number of objects. Hence we use uint32_t in the code. But the byte count of even a .idx file can be much larger than that, because it stores at least a hash and an offset for each object. So using SHA-1, a v2 .idx file will cross the 4GB boundary at 153,391,650 objects. This confuses load_idx(), which computes the minimum size like this: unsigned long min_size = 8 + 4*256 + nr*(hashsz + 4 + 4) + hashsz + hashsz; Even though min_size will be big enough on most 64-bit platforms, the actual arithmetic is done as a uint32_t, resulting in a truncation. We actually exceed that min_size, but then we do: unsigned long max_size = min_size; if (nr) max_size += (nr - 1)*8; to account for the variable-sized table. That computation doesn't overflow quite so low, but with the truncation for min_size, we end up with a max_size that is much smaller than our actual size. So we complain that the idx is invalid, and can't find any of its objects. We can fix this case by casting "nr" to a size_t, which will do the multiplication in 64-bits (assuming you're on a 64-bit platform; this will never work on a 32-bit system since we couldn't map the whole .idx anyway). Likewise, we don't have to worry about further additions, because adding a smaller number to a size_t will convert the other side to a size_t. A few notes: - obviously we could just declare "nr" as a size_t in the first place (and likewise, packed_git.num_objects). But it's conceptually a uint32_t because of the on-disk format, and we correctly treat it that way in other contexts that don't need to compute byte offsets (e.g., iterating over the set of objects should and generally does use a uint32_t). Switching to size_t would make all of those other cases look wrong. - it could be argued that the proper type is off_t to represent the file offset. But in practice the .idx file must fit within memory, because we mmap the whole thing. And the rest of the code (including the idx_size variable we're comparing against) uses size_t. - we'll add the same cast to the max_size arithmetic line. Even though we're adding to a larger type, which will convert our result, the multiplication is still done as a 32-bit value and can itself overflow. I didn't check this with my test case, since it would need an even larger pack (~530M objects), but looking at compiler output shows that it works this way. The standard should agree, but I couldn't find anything explicit in 6.3.1.8 ("usual arithmetic conversions"). The case in load_idx() was the most immediate one that I was able to trigger. After fixing it, looking up actual objects (including the very last one in sha1 order) works in a test repo with 153,725,110 objects. That's because bsearch_hash() works with uint32_t entry indices, and the actual byte access: int cmp = hashcmp(table + mi * stride, sha1); is done with "stride" as a size_t, causing the uint32_t "mi" to be promoted to a size_t. This is the way most code will access the index data. However, I audited all of the other byte-wise accesses of packed_git.index_data, and many of the others are suspect (they are similar to the max_size one, where we are adding to a properly sized offset or directly to a pointer, but the multiplication in the sub-expression can overflow). I didn't trigger any of these in practice, but I believe they're potential problems, and certainly adding in the cast is not going to hurt anything here. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-09Merge branch 'jk/index-pack-hotfixes'Junio C Hamano
Hotfix and clean-up for the jt/threaded-index-pack topic that has graduated to v2.29-rc0. * jk/index-pack-hotfixes: index-pack: make get_base_data() comment clearer index-pack: drop type_cas mutex index-pack: restore "resolving deltas" progress meter
2020-10-07index-pack: make get_base_data() comment clearerJonathan Tan
A comment mentions that we may free cached delta bases via find_unresolved_deltas(), but that function went away in f08cbf60fe (index-pack: make quantum of work smaller, 2020-09-08). Since we need to rewrite that comment anyway, make the entire comment clearer. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-07index-pack: drop type_cas mutexJeff King
The type_cas lock lost all of its callers in f08cbf60fe (index-pack: make quantum of work smaller, 2020-09-08), so we can safely delete it. The compiler didn't alert us that the variable became unused, because we still call pthread_mutex_init() and pthread_mutex_destroy() on it. It's worth considering also whether that commit was in error to remove the use of the lock. Why don't we need it now, if we did before, as described in ab791dd138 (index-pack: fix race condition with duplicate bases, 2014-08-29)? I think the answer is that we now look at and assign the child_obj->real_type field in the main thread while holding the work_lock(). So we don't have to worry about racing with the worker threads. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-07index-pack: restore "resolving deltas" progress meterJeff King
Commit f08cbf60fe (index-pack: make quantum of work smaller, 2020-09-08) refactored the main loop in threaded_second_pass(), but also deleted the call to display_progress() at the top of the loop. This means that users typically see no progress at all during the delta resolution phase (and for large repositories, Git appears to hang). This looks like an accident that was unrelated to the intended change of that commit, since we continue to update nr_resolved_deltas in resolve_delta(). Let's restore the call to get that progress back. We'll also add a test that confirms we generate the expected progress. This isn't perfect, as it wouldn't catch a bug where progress was delayed to the end. That was probably possible to trigger when receiving a thin pack, because we'd eventually call display_progress() from fix_unresolved_deltas(), but only once after doing all the work. However, since our test case generates a complete pack, it reliably demonstrates this particular bug and its fix. And we can't do better without making the test racy. Signed-off-by: Jeff King <peff@peff.net> Acked-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-22Merge branch 'jt/threaded-index-pack'Junio C Hamano
"git index-pack" learned to resolve deltified objects with greater parallelism. * jt/threaded-index-pack: index-pack: make quantum of work smaller index-pack: make resolve_delta() assume base data index-pack: calculate {ref,ofs}_{first,last} early index-pack: remove redundant child field index-pack: unify threaded and unthreaded code index-pack: remove redundant parameter Documentation: deltaBaseCacheLimit is per-thread
2020-09-08index-pack: make quantum of work smallerJonathan Tan
Currently, when index-pack resolves deltas, it does not split up delta trees into threads: each delta base root (an object that is not a REF_DELTA or OFS_DELTA) can go into its own thread, but all deltas on that root (direct or indirect) are processed in the same thread. This is a problem when a repository contains a large text file (thus, delta-able) that is modified many times - delta resolution time during fetching is dominated by processing the deltas corresponding to that text file. This patch contains a solution to that. When cloning using git -c core.deltabasecachelimit=1g clone \ https://fuchsia.googlesource.com/third_party/vulkan-cts on my laptop, clone time improved from 3m2s to 2m5s (using 3 threads, which is the default). The solution is to have a global work stack. This stack contains delta bases (objects, whether appearing directly in the packfile or generated by delta resolution, that themselves have delta children) that need to be processed; whenever a thread needs work, it peeks at the top of the stack and processes its next unprocessed child. If a thread finds the stack empty, it will look for more delta base roots to push on the stack instead. The main weakness of having a global work stack is that more time is spent in the mutex, but profiling has shown that most time is spent in the resolution of the deltas themselves, so this shouldn't be an issue in practice. In any case, experimentation (as described in the clone command above) shows that this patch is a net improvement. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-24index-pack: make resolve_delta() assume base dataJonathan Tan
A subsequent commit will make the quantum of work smaller, necessitating more locking. This commit allows resolve_delta() to be called outside the lock. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-24index-pack: calculate {ref,ofs}_{first,last} earlyJonathan Tan
This is refactoring 2 of 2 to simplify struct base_data. Whenever we make a struct base_data, immediately calculate its delta children. This eliminates confusion as to when the {ref,ofs}_{first,last} fields are initialized. Before this patch, the delta children were calculated at the last possible moment. This allowed the members of struct base_data to be populated in any order, superficially useful when we have the object contents before the struct object_entry. But this makes reasoning about the state of struct base_data more complicated, hence this patch. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-24index-pack: remove redundant child fieldJonathan Tan
This is refactoring 1 of 2 to simplify struct base_data. In index-pack, each thread maintains a doubly-linked list of the delta chain that it is currently processing (the "base" and "child" pointers in struct base_data). When a thread exceeds the delta base cache limit and needs to reclaim memory, it uses the "child" pointers to traverse the lineage, reclaiming the memory of the eldest delta bases first. A subsequent patch will perform memory reclaiming in a different way and will thus no longer need the "child" pointer. Because the "child" pointer is redundant even now, remove it so that the aforementioned subsequent patch will be clearer. In the meantime, reclaim memory in the reverse order of the "base" pointers. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-24index-pack: unify threaded and unthreaded codeJonathan Tan
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-24index-pack: remove redundant parameterJonathan Tan
find_{ref,ofs}_delta_{,children} take an enum object_type parameter, but the object type is already present in the name of the function. Remove that parameter from these functions. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-21index-pack: adjust default threading capJeff King
Commit b8a2486f15 (index-pack: support multithreaded delta resolving, 2012-05-06) describes an experiment that shows that setting the number of threads for index-pack higher than 3 does not help. I repeated that experiment using a more modern version of Git and a more modern CPU and got different results. Here are timings for p5302 against linux.git run on my laptop, a Core i9-9880H with 8 cores plus hyperthreading (so online-cpus returns 16): 5302.3: index-pack 0 threads 256.28(253.41+2.79) 5302.4: index-pack 1 threads 257.03(254.03+2.91) 5302.5: index-pack 2 threads 149.39(268.34+3.06) 5302.6: index-pack 4 threads 94.96(294.10+3.23) 5302.7: index-pack 8 threads 68.12(339.26+3.89) 5302.8: index-pack 16 threads 70.90(655.03+7.21) 5302.9: index-pack default number of threads 116.91(290.05+3.21) You can see that wall-clock times continue to improve dramatically up to the number of cores, but bumping beyond that (into hyperthreading territory) does not help (and in fact hurts a little). Here's the same experiment on a machine with dual Xeon 6230's, totaling 40 cores (80 with hyperthreading): 5302.3: index-pack 0 threads 310.04(302.73+6.90) 5302.4: index-pack 1 threads 310.55(302.68+7.40) 5302.5: index-pack 2 threads 178.17(304.89+8.20) 5302.6: index-pack 5 threads 99.53(315.54+9.56) 5302.7: index-pack 10 threads 72.80(327.37+12.79) 5302.8: index-pack 20 threads 60.68(357.74+21.66) 5302.9: index-pack 40 threads 58.07(454.44+67.96) 5302.10: index-pack 80 threads 59.81(720.45+334.52) 5302.11: index-pack default number of threads 134.18(309.32+7.98) The results are similar; things stop improving at 40 threads. Curiously, going from 20 to 40 really doesn't help much, either (and increases CPU time considerably). So that may represent an actual barrier to parallelism, where we lose out due to context-switching and loss of cache locality, but don't reap the wall-clock benefits due to contention of our coarse-grained locks. So what's a good default value? It's clear that the current cap of 3 is too low; our default values are 42% and 57% slower than the best times on each machine. The results on the 40-core machine imply that 20 threads is an actual barrier regardless of the number of cores, so we'll take that as a maximum. We get the best results on these machines at half of the online-cpus value. That's presumably a result of the hyperthreading. That's common on multi-core Intel processors, but not necessarily elsewhere. But if we take it as an assumption, we can perform optimally on hyperthreaded machines and still do much better than the status quo on other machines, as long as we never half below the current value of 3. So that's what this patch does. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-19builtin/index-pack: add option to specify hash algorithmbrian m. carlson
git index-pack is usually run in a repository, but need not be. Since packs don't contains information on the algorithm in use, instead relying on context, add an option to index-pack to tell it which one we're using in case someone runs it outside of a repository. Since using --stdin necessarily implies a repository, don't allow specifying an object format if it's provided to prevent users from passing an option that won't work. Add documentation for this option. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-27packfile: compute and use the index CRC offsetbrian m. carlson
Both v2 pack index files and the v3 format specified as part of the NewHash work have similar data starting at the CRC table. Much of the existing code wants to read either this table or the offset entries following it, and in doing so computes the offset each time. In order to share as much code between v2 and v3, compute the offset of the CRC table and store it when the pack is opened. Use this value to compute offsets to not only the CRC table, but to the offset entries beyond it. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-02promisor-remote: accept 0 as oid_nr in functionJonathan Tan
There are 3 callers to promisor_remote_get_direct() that first check if the number of objects to be fetched is equal to 0. Fold that check into promisor_remote_get_direct(), and in doing so, be explicit as to what promisor_remote_get_direct() does if oid_nr is 0 (it returns 0, success, immediately). Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-14Merge branch 'jk/index-pack-dupfix'Junio C Hamano
The index-pack code now diagnoses a bad input packstream that records the same object twice when it is used as delta base; the code used to declare a software bug when encountering such an input, but it is an input error. * jk/index-pack-dupfix: index-pack: downgrade twice-resolved REF_DELTA to die()
2020-02-04index-pack: downgrade twice-resolved REF_DELTA to die()Jeff King
When we're resolving a REF_DELTA, we compare-and-swap its type from REF_DELTA to whatever real type the base object has, as discussed in ab791dd138 (index-pack: fix race condition with duplicate bases, 2014-08-29). If the old type wasn't a REF_DELTA, we consider that a BUG(). But as discussed in that commit, we might see this case whenever we try to resolve an object twice, which may happen because we have multiple copies of the base object. So this isn't a bug at all, but rather a sign that the input pack is broken. And indeed, this case is triggered already in t5309.5 and t5309.6, which create packs with delta cycles and duplicate bases. But we never noticed because those tests are marked expect_failure. Those tests were added by b2ef3d9ebb (test index-pack on packs with recoverable delta cycles, 2013-08-23), which was leaving the door open for cases that we theoretically _could_ handle. And when we see an already-resolved object like this, in theory we could keep going after confirming that the previously resolved child->real_type matches base->obj->real_type. But: - enforcing the "only resolve once" rule here saves us from an infinite loop in other parts of the code. If we keep going, then the delta cycle in t5309.5 causes us to loop infinitely, as find_ref_delta_children() doesn't realize which objects have already been resolved. So there would be more changes needed to make this case work, and in the meantime we'd be worse off. - any pack that triggers this is broken anyway. It either has a duplicate base object, or it has a cycle which causes us to bring in a duplicate via --fix-thin. In either case, we'd end up rejecting the pack in write_idx_file(), which also detects duplicates. So the tests have little value in documenting what we _could_ be doing (and have been neglected for 6+ years). Let's switch them to confirming that we handle this case cleanly (and switch out the BUG() for a more informative die() so that we do so). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-31sha1-file: allow check_object_signature() to handle any repoMatheus Tavares
Some callers of check_object_signature() can work on arbitrary repositories, but the repo does not get passed to this function. Instead, the_repository is always used internally. To fix possible inconsistencies, allow the function to receive a struct repository and make those callers pass on the repo being handled. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-31sha1-file: pass git_hash_algo to hash_object_file()Matheus Tavares
Allow hash_object_file() to work on arbitrary repos by introducing a git_hash_algo parameter. Change callers which have a struct repository pointer in their scope to pass on the git_hash_algo from the said repo. For all other callers, pass on the_hash_algo, which was already being used internally at hash_object_file(). This functionality will be used in the following patch to make check_object_signature() be able to work on arbitrary repos (which, in turn, will be used to fix an inconsistency at object.c:parse_object()). Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-31streaming: allow open_istream() to handle any repoMatheus Tavares
Some callers of open_istream() at archive-tar.c and archive-zip.c are capable of working on arbitrary repositories but the repo struct is not passed down to open_istream(), which uses the_repository internally. For now, that's not a problem since the said callers are only being called with the_repository. But to be consistent and avoid future problems, let's allow open_istream() to receive a struct repository and use that instead of the_repository. This parameter addition will also be used in a future patch to make sha1-file.c:check_object_signature() be able to work on arbitrary repos. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-11Merge branch 'bc/object-id-part17'Junio C Hamano
Preparation for SHA-256 upgrade continues. * bc/object-id-part17: (26 commits) midx: switch to using the_hash_algo builtin/show-index: replace sha1_to_hex rerere: replace sha1_to_hex builtin/receive-pack: replace sha1_to_hex builtin/index-pack: replace sha1_to_hex packfile: replace sha1_to_hex wt-status: convert struct wt_status to object_id cache: remove null_sha1 builtin/worktree: switch null_sha1 to null_oid builtin/repack: write object IDs of the proper length pack-write: use hash_to_hex when writing checksums sequencer: convert to use the_hash_algo bisect: switch to using the_hash_algo sha1-lookup: switch hard-coded constants to the_hash_algo config: use the_hash_algo in abbrev comparison combine-diff: replace GIT_SHA1_HEXSZ with the_hash_algo bundle: switch to use the_hash_algo connected: switch GIT_SHA1_HEXSZ to the_hash_algo show-index: switch hard-coded constants to the_hash_algo blame: remove needless comparison with GIT_SHA1_HEXSZ ...
2019-08-19builtin/index-pack: replace sha1_to_hexbrian m. carlson
Since sha1_to_hex is limited to SHA-1, replace it with hash_to_hex so this code works with other algorithms. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-25Use promisor_remote_get_direct() and has_promisor_remote()Christian Couder
Instead of using the repository_format_partial_clone global and fetch_objects() directly, let's use has_promisor_remote() and promisor_remote_get_direct(). This way all the configured promisor remotes will be taken into account, not only the one specified by extensions.partialClone. Also when cloning or fetching using a partial clone filter, remote.origin.promisor will be set to "true" instead of setting extensions.partialClone to "origin". This makes it possible to use many promisor remote just by fetching from them. Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-15index-pack: prefetch missing REF_DELTA basesJonathan Tan
When fetching, the client sends "have" commit IDs indicating that the server does not need to send any object referenced by those commits, reducing network I/O. When the client is a partial clone, the client still sends "have"s in this way, even if it does not have every object referenced by a commit it sent as "have". If a server omits such an object, it is fine: the client could lazily fetch that object before this fetch, and it can still do so after. The issue is when the server sends a thin pack containing an object that is a REF_DELTA against such a missing object: index-pack fails to fix the thin pack. When support for lazily fetching missing objects was added in 8b4c0103a9 ("sha1_file: support lazily fetching missing objects", 2017-12-08), support in index-pack was turned off in the belief that it accesses the repo only to do hash collision checks. However, this is not true: it also needs to access the repo to resolve REF_DELTA bases. Support for lazy fetching should still generally be turned off in index-pack because it is used as part of the lazy fetching process itself (if not, infinite loops may occur), but we do need to fetch the REF_DELTA bases. (When fetching REF_DELTA bases, it is unlikely that those are REF_DELTA themselves, because we do not send "have" when making such fetches.) To resolve this, prefetch all missing REF_DELTA bases before attempting to resolve them. This both ensures that all bases are attempted to be fetched, and ensures that we make only one request per index-pack invocation, and not one request per missing object. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-01index-pack: show progress while checking objectsSZEDER Gábor
When 'git index-pack' is run by 'git clone', its check_objects() function usually doesn't take long enough to be a concern, but I just run into a situation where it took about a minute or so: I inadvertently put some memory pressure on my tiny laptop while cloning linux.git, and then there was quite a long silence between the "Resolving deltas" and "Checking connectivity" progress bars. Show a progress bar during the loop of check_objects() to let the user know that something is still going on. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-08convert has_sha1_file() callers to has_object_file()Jeff King
The only remaining callers of has_sha1_file() actually have an object_id already. They can use the "object" variant, rather than dereferencing the hash themselves. The code changes here were completely generated by the included coccinelle patch. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-19Merge branch 'tb/print-size-t-with-uintmax-format'Junio C Hamano
Code preparation to replace ulong vars with size_t vars where appropriate. * tb/print-size-t-with-uintmax-format: Upcast size_t variables to uintmax_t when printing