summaryrefslogtreecommitdiff
path: root/bulk-checkin.c
AgeCommit message (Collapse)Author
2022-07-29t5351: avoid relying on `core.fsyncMethod = batch` to be supportedJohannes Schindelin
On FreeBSD, this mode is not supported. But since 3a251bac0d1a (trace2: only include "fsync" events if we git_fsync(), 2022-07-18) t5351 will fail if this mode is unsupported. Let's address this in the minimal fashion, by detecting that that mode is unsupported and expecting a different count of hardware flushes in that case. This fixes the CI/PR builds on FreeBSD again. Note: A better way would be to test only what is relevant in t5351.6 "unpack big object in stream (core.fsyncmethod=batch)" again instead of blindly comparing the output against some exact text. But that would pretty much revert the idea of above-mentioned commit, and that commit was _just_ accepted into Git's main branch so one must assume that it was intentional. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-06-03Merge branch 'tb/cruft-packs'Junio C Hamano
A mechanism to pack unreachable objects into a "cruft pack", instead of ejecting them into loose form to be reclaimed later, has been introduced. * tb/cruft-packs: sha1-file.c: don't freshen cruft packs builtin/gc.c: conditionally avoid pruning objects via loose builtin/repack.c: add cruft packs to MIDX during geometric repack builtin/repack.c: use named flags for existing_packs builtin/repack.c: allow configuring cruft pack generation builtin/repack.c: support generating a cruft pack builtin/pack-objects.c: --cruft with expiration reachable: report precise timestamps from objects in cruft packs reachable: add options to add_unseen_recent_objects_to_traversal builtin/pack-objects.c: --cruft without expiration builtin/pack-objects.c: return from create_object_entry() t/helper: add 'pack-mtimes' test-tool pack-mtimes: support writing pack .mtimes files chunk-format.h: extract oid_version() pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' pack-mtimes: support reading .mtimes files Documentation/technical: add cruft-packs.txt
2022-06-03Merge branch 'ns/batch-fsync'Junio C Hamano
Introduce a filesystem-dependent mechanism to optimize the way the bits for many loose object files are ensured to hit the disk platter. * ns/batch-fsync: core.fsyncmethod: performance tests for batch mode t/perf: add iteration setup mechanism to perf-lib core.fsyncmethod: tests for batch mode test-lib-functions: add parsing helpers for ls-files and ls-tree core.fsync: use batch mode and sync loose objects by default on Windows unpack-objects: use the bulk-checkin infrastructure update-index: use the bulk-checkin infrastructure builtin/add: add ODB transaction around add_files_to_cache cache-tree: use ODB transaction around writing a tree core.fsyncmethod: batched disk flushes for loose-objects bulk-checkin: rebrand plug/unplug APIs as 'odb transactions' bulk-checkin: rename 'state' variable and separate 'plugged' boolean
2022-05-26pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'Taylor Blau
This structure will be used to communicate the per-object mtimes when writing a cruft pack. Here, we need the full packing_data structure because the mtime information is stored in an array there, not on the individual object_entry's themselves (to avoid paying the overhead in structure width for operations which do not generate a cruft pack). We haven't passed this information down before because one of the two callers (in bulk-checkin.c) does not have a packing_data structure at all. In that case (where no cruft pack will be generated), NULL is passed instead. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-06core.fsyncmethod: batched disk flushes for loose-objectsNeeraj Singh
When adding many objects to a repo with `core.fsync=loose-object`, the cost of fsync'ing each object file can become prohibitive. One major source of the cost of fsync is the implied flush of the hardware writeback cache within the disk drive. This commit introduces a new `core.fsyncMethod=batch` option that batches up hardware flushes. It hooks into the bulk-checkin odb-transaction functionality, takes advantage of tmp-objdir, and uses the writeout-only support code. When the new mode is enabled, we do the following for each new object: 1a. Create the object in a tmp-objdir. 2a. Issue a pagecache writeback request and wait for it to complete. At the end of the entire transaction when unplugging bulk checkin: 1b. Issue an fsync against a dummy file to flush the log and hardware writeback cache, which should by now have seen the tmp-objdir writes. 2b. Rename all of the tmp-objdir files to their final names. 3b. When updating the index and/or refs, we assume that Git will issue another fsync internal to that operation. This is not the default today, but the user now has the option of syncing the index and there is a separate patch series to implement syncing of refs. On a filesystem with a singular journal that is updated during name operations (e.g. create, link, rename, etc), such as NTFS, HFS+, or XFS we would expect the fsync to trigger a journal writeout so that this sequence is enough to ensure that the user's data is durable by the time the git command returns. This sequence also ensures that no object files appear in the main object store unless they are fsync-durable. Batch mode is only enabled if core.fsync includes loose-objects. If the legacy core.fsyncObjectFiles setting is enabled, but core.fsync does not include loose-objects, we will use file-by-file fsyncing. In step (1a) of the sequence, the tmp-objdir is created lazily to avoid work if no loose objects are ever added to the ODB. We use a tmp-objdir to maintain the invariant that no loose-objects are visible in the main ODB unless they are properly fsync-durable. This is important since future ODB operations that try to create an object with specific contents will silently drop the new data if an object with the target hash exists without checking that the loose-object contents match the hash. Only a full git-fsck would restore the ODB to a functional state where dataloss doesn't occur. In step (1b) of the sequence, we issue a fsync against a dummy file created specifically for the purpose. This method has a little higher cost than using one of the input object files, but makes adding new callers of this mechanism easier, since we don't need to figure out which object file is "last" or risk sharing violations by caching the fd of the last object file. _Performance numbers_: Linux - Hyper-V VM running Kernel 5.11 (Ubuntu 20.04) on a fast SSD. Mac - macOS 11.5.1 running on a Mac mini on a 1TB Apple SSD. Windows - Same host as Linux, a preview version of Windows 11. Adding 500 files to the repo with 'git add' Times reported in seconds. object file syncing | Linux | Mac | Windows --------------------|-------|-------|-------- disabled | 0.06 | 0.35 | 0.61 fsync | 1.88 | 11.18 | 2.47 batch | 0.15 | 0.41 | 1.53 Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-06bulk-checkin: rebrand plug/unplug APIs as 'odb transactions'Neeraj Singh
Make it clearer in the naming and documentation of the plug_bulk_checkin and unplug_bulk_checkin APIs that they can be thought of as a "transaction" to optimize operations on the object database. These transactions may be nested so that subsystems like the cache-tree writing code can optimize their operations without caring whether the top-level code has a transaction active. Add a flush_odb_transaction API that will be used in update-index to make objects visible even if a transaction is active. The flush call may also be useful in future cases if we hold a transaction active around calling hooks. Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-06bulk-checkin: rename 'state' variable and separate 'plugged' booleanNeeraj Singh
This commit prepares for adding batch-fsync to the bulk-checkin infrastructure. The bulk-checkin infrastructure is currently used to batch up addition of large blobs to a packfile. When a blob is larger than big_file_threshold, we unconditionally add it to a pack. If bulk checkins are 'plugged', we allow multiple large blobs to be added to a single pack until we reach the packfile size limit; otherwise, we simply make a new packfile for each large blob. The 'unplug' call tells us when the series of blob additions is done so that we can finish the packfiles and make their objects available to subsequent operations. Stated another way, bulk-checkin allows callers to define a transaction that adds multiple objects to the object database, where the object database can optimize its internal operations within the transaction boundary. Batched fsync will fit into bulk-checkin by taking advantage of the plug/unplug functionality to determine the appropriate time to fsync and make newly-added objects available in the primary object database. * Rename 'state' variable to 'bulk_checkin_packfile', since we will later be adding 'bulk_fsync_objdir'. This also makes the variable easier to find in the debugger, since the name is more unique. * Rename finish_bulk_checkin to flush_bulk_checkin_packfile and call it unconditionally from unplug_bulk_checkin. Internally it will conditionally do a flush if there's any work to do. * Move the 'plugged' data member of 'bulk_checkin_state' into a separate static variable. Doing this avoids resetting the variable in finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we seem to unintentionally disable the plugging functionality the first time a new packfile must be created due to packfile size limits. While disabling the plugging state only results in suboptimal behavior for the current code, it would be fatal for the bulk-fsync functionality later in this patch series. The net effect of these changes is to make a clear separation between the portion of the bulk-checkin infrastructure that is related to the packfile (nearly all of it at present) and the part that is related to other future optimizations of the ODB. Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-25Merge branch 'ns/core-fsyncmethod'Junio C Hamano
Replace core.fsyncObjectFiles with two new configuration variables, core.fsync and core.fsyncMethod. * ns/core-fsyncmethod: core.fsync: documentation and user-friendly aggregate options core.fsync: new option to harden the index core.fsync: add configuration parsing core.fsync: introduce granular fsync control infrastructure core.fsyncmethod: add writeout-only mode wrapper: make inclusion of Windows csprng header tightly scoped
2022-03-10core.fsync: introduce granular fsync control infrastructureNeeraj Singh
This commit introduces the infrastructure for the core.fsync configuration knob. The repository components we want to sync are identified by flags so that we can turn on or off syncing for specific components. If core.fsyncObjectFiles is set and the core.fsync configuration also includes FSYNC_COMPONENT_LOOSE_OBJECT, we will fsync any loose objects. This picks the strictest data integrity behavior if core.fsync and core.fsyncObjectFiles are set to conflicting values. This change introduces the currently unused fsync_component helper, which will be used by a later patch that adds fsyncing to the refs backend. Actual configuration and documentation of the fsync components list are in other patches in the series to separate review of the underlying mechanism from the policy of how it's configured. Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-26object-file API: add a format_object_header() functionÆvar Arnfjörð Bjarmason
Add a convenience function to wrap the xsnprintf() command that generates loose object headers. This code was copy/pasted in various parts of the codebase, let's define it in one place and re-use it from there. All except one caller of it had a valid "enum object_type" for us, it's only write_object_file_prepare() which might need to deal with "git hash-object --literally" and a potential garbage type. Let's have the primary API use an "enum object_type", and define a *_literally() function that can take an arbitrary "const char *" for the type. See [1] for the discussion that prompted this patch, i.e. new code in object-file.c that wanted to copy/paste the xsnprintf() invocation. In the case of fast-import.c the callers unfortunately need to cast back & forth between "unsigned char *" and "char *", since format_object_header() ad encode_in_pack_object_header() take different signedness. 1. https://lore.kernel.org/git/211213.86bl1l9bfz.gmgdl@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Jiang Xin <zhiyou.jx@alibaba-inc.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-10pack-write: split up finish_tmp_packfile() functionÆvar Arnfjörð Bjarmason
Split up the finish_tmp_packfile() function and use the split-up version in pack-objects.c in preparation for moving the step of renaming the *.idx file later as part of a function change. Since the only other caller of finish_tmp_packfile() was in bulk-checkin.c, and it won't be needing a change to its *.idx renaming, provide a thin wrapper for the old function as a static function in that file. If other callers end up needing the simpler version it could be moved back to "pack-write.c" and "pack.h". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-10pack-write: refactor renaming in finish_tmp_packfile()Ævar Arnfjörð Bjarmason
Refactor the renaming in finish_tmp_packfile() into a helper function. The callers are now expected to pass a "name_buffer" ending in "pack-OID." instead of the previous "pack-", we then append "pack", "idx" or "rev" to it. By doing the strbuf_setlen() in rename_tmp_packfile() we reuse the buffer and avoid the repeated allocations we'd get if that function had its own temporary "struct strbuf". This approach of reusing the buffer does make the last user in pack-object.c's write_pack_file() slightly awkward, since we needlessly do a strbuf_setlen() before calling strbuf_release() for consistency. In subsequent changes we'll move that bitmap writing code around, so let's not skip the strbuf_setlen() now. The previous strbuf_reset() idiom originated with 5889271114a (finish_tmp_packfile():use strbuf for pathname construction, 2014-03-03), which in turn was a minimal adjustment of pre-strbuf code added in 0e990530ae (finish_tmp_packfile(): a helper function, 2011-10-28). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-10bulk-checkin.c: store checksum directlyTaylor Blau
finish_bulk_checkin() stores the checksum from finalize_hashfile() by writing to the `hash` member of `struct object_id`, but that hash has nothing to do with an object id (it's just a convenient location that happens to be sized correctly). Store the hash directly in an unsigned char array. This behaves the same as writing to the `hash` member, but makes the intent clearer (and avoids allocating an extra four bytes for the `algo` member of `struct object_id`). It likewise prevents the possibility of a segfault when reading `algo` (e.g., by calling `oid_to_hex()`) if it is uninitialized. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-11bulk-checkin: make buffer reuse more obvious and saferAndrzej Hunt
ibuf can be reused for multiple iterations of the loop. Specifically: deflate() overwrites s.avail_in to show how much of the input buffer has not been processed yet - and sometimes leaves 'avail_in > 0', in which case ibuf will be processed again during the loop's subsequent iteration. But if we declare ibuf within the loop, then (in theory) we get a new (and uninitialised) buffer for every iteration. In practice, my compiler seems to resue the same buffer - meaning that this code does work - but it doesn't seem safe to rely on this behaviour. MSAN correctly catches this issue - as soon as we hit the 's.avail_in > 0' condition, we end up reading from what seems to be uninitialised memory. Therefore, we move ibuf out of the loop, making this reuse safe. See MSAN output from t1050-large below - the interesting part is the ibuf creation at the end, although there's a lot of indirection before we reach the read from unitialised memory: ==11294==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x7f75db58fb1c in crc32_little crc32.c:283:9 #1 0x7f75db58d5b3 in crc32_z crc32.c:220:20 #2 0x7f75db59668c in crc32 crc32.c:242:12 #3 0x8c94f8 in hashwrite csum-file.c:101:15 #4 0x825faf in stream_to_pack bulk-checkin.c:154:5 #5 0x82467b in deflate_to_pack bulk-checkin.c:225:8 #6 0x823ff1 in index_bulk_checkin bulk-checkin.c:264:15 #7 0xa7cff2 in index_stream object-file.c:2234:9 #8 0xa7bff7 in index_fd object-file.c:2256:9 #9 0xa7d22d in index_path object-file.c:2274:7 #10 0xb3c8c9 in add_to_index read-cache.c:802:7 #11 0xb3e039 in add_file_to_index read-cache.c:835:9 #12 0x4a99c3 in add_files add.c:458:7 #13 0x4a7276 in cmd_add add.c:670:18 #14 0x4a1e76 in run_builtin git.c:461:11 #15 0x49e1e7 in handle_builtin git.c:714:3 #16 0x4a0c08 in run_argv git.c:781:4 #17 0x49d5a8 in cmd_main git.c:912:19 #18 0x7974da in main common-main.c:52:11 #19 0x7f75da66f349 in __libc_start_main (/lib64/libc.so.6+0x24349) #20 0x421bd9 in _start start.S:120 Uninitialized value was stored to memory at #0 0x7f75db58fa6b in crc32_little crc32.c:283:9 #1 0x7f75db58d5b3 in crc32_z crc32.c:220:20 #2 0x7f75db59668c in crc32 crc32.c:242:12 #3 0x8c94f8 in hashwrite csum-file.c:101:15 #4 0x825faf in stream_to_pack bulk-checkin.c:154:5 #5 0x82467b in deflate_to_pack bulk-checkin.c:225:8 #6 0x823ff1 in index_bulk_checkin bulk-checkin.c:264:15 #7 0xa7cff2 in index_stream object-file.c:2234:9 #8 0xa7bff7 in index_fd object-file.c:2256:9 #9 0xa7d22d in index_path object-file.c:2274:7 #10 0xb3c8c9 in add_to_index read-cache.c:802:7 #11 0xb3e039 in add_file_to_index read-cache.c:835:9 #12 0x4a99c3 in add_files add.c:458:7 #13 0x4a7276 in cmd_add add.c:670:18 #14 0x4a1e76 in run_builtin git.c:461:11 #15 0x49e1e7 in handle_builtin git.c:714:3 #16 0x4a0c08 in run_argv git.c:781:4 #17 0x49d5a8 in cmd_main git.c:912:19 #18 0x7974da in main common-main.c:52:11 #19 0x7f75da66f349 in __libc_start_main (/lib64/libc.so.6+0x24349) Uninitialized value was stored to memory at #0 0x447eb9 in __msan_memcpy msan_interceptors.cpp:1558:3 #1 0x7f75db5c2011 in flush_pending deflate.c:746:5 #2 0x7f75db5cafa0 in deflate_stored deflate.c:1815:9 #3 0x7f75db5bb7d2 in deflate deflate.c:1005:34 #4 0xd80b7f in git_deflate zlib.c:244:12 #5 0x825dff in stream_to_pack bulk-checkin.c:140:12 #6 0x82467b in deflate_to_pack bulk-checkin.c:225:8 #7 0x823ff1 in index_bulk_checkin bulk-checkin.c:264:15 #8 0xa7cff2 in index_stream object-file.c:2234:9 #9 0xa7bff7 in index_fd object-file.c:2256:9 #10 0xa7d22d in index_path object-file.c:2274:7 #11 0xb3c8c9 in add_to_index read-cache.c:802:7 #12 0xb3e039 in add_file_to_index read-cache.c:835:9 #13 0x4a99c3 in add_files add.c:458:7 #14 0x4a7276 in cmd_add add.c:670:18 #15 0x4a1e76 in run_builtin git.c:461:11 #16 0x49e1e7 in handle_builtin git.c:714:3 #17 0x4a0c08 in run_argv git.c:781:4 #18 0x49d5a8 in cmd_main git.c:912:19 #19 0x7974da in main common-main.c:52:11 Uninitialized value was stored to memory at #0 0x447eb9 in __msan_memcpy msan_interceptors.cpp:1558:3 #1 0x7f75db644241 in _tr_stored_block trees.c:873:5 #2 0x7f75db5cad7c in deflate_stored deflate.c:1813:9 #3 0x7f75db5bb7d2 in deflate deflate.c:1005:34 #4 0xd80b7f in git_deflate zlib.c:244:12 #5 0x825dff in stream_to_pack bulk-checkin.c:140:12 #6 0x82467b in deflate_to_pack bulk-checkin.c:225:8 #7 0x823ff1 in index_bulk_checkin bulk-checkin.c:264:15 #8 0xa7cff2 in index_stream object-file.c:2234:9 #9 0xa7bff7 in index_fd object-file.c:2256:9 #10 0xa7d22d in index_path object-file.c:2274:7 #11 0xb3c8c9 in add_to_index read-cache.c:802:7 #12 0xb3e039 in add_file_to_index read-cache.c:835:9 #13 0x4a99c3 in add_files add.c:458:7 #14 0x4a7276 in cmd_add add.c:670:18 #15 0x4a1e76 in run_builtin git.c:461:11 #16 0x49e1e7 in handle_builtin git.c:714:3 #17 0x4a0c08 in run_argv git.c:781:4 #18 0x49d5a8 in cmd_main git.c:912:19 #19 0x7974da in main common-main.c:52:11 Uninitialized value was stored to memory at #0 0x447eb9 in __msan_memcpy msan_interceptors.cpp:1558:3 #1 0x7f75db5c8fcf in deflate_stored deflate.c:1783:9 #2 0x7f75db5bb7d2 in deflate deflate.c:1005:34 #3 0xd80b7f in git_deflate zlib.c:244:12 #4 0x825dff in stream_to_pack bulk-checkin.c:140:12 #5 0x82467b in deflate_to_pack bulk-checkin.c:225:8 #6 0x823ff1 in index_bulk_checkin bulk-checkin.c:264:15 #7 0xa7cff2 in index_stream object-file.c:2234:9 #8 0xa7bff7 in index_fd object-file.c:2256:9 #9 0xa7d22d in index_path object-file.c:2274:7 #10 0xb3c8c9 in add_to_index read-cache.c:802:7 #11 0xb3e039 in add_file_to_index read-cache.c:835:9 #12 0x4a99c3 in add_files add.c:458:7 #13 0x4a7276 in cmd_add add.c:670:18 #14 0x4a1e76 in run_builtin git.c:461:11 #15 0x49e1e7 in handle_builtin git.c:714:3 #16 0x4a0c08 in run_argv git.c:781:4 #17 0x49d5a8 in cmd_main git.c:912:19 #18 0x7974da in main common-main.c:52:11 #19 0x7f75da66f349 in __libc_start_main (/lib64/libc.so.6+0x24349) Uninitialized value was stored to memory at #0 0x447eb9 in __msan_memcpy msan_interceptors.cpp:1558:3 #1 0x7f75db5ea545 in read_buf deflate.c:1181:5 #2 0x7f75db5c97f7 in deflate_stored deflate.c:1791:9 #3 0x7f75db5bb7d2 in deflate deflate.c:1005:34 #4 0xd80b7f in git_deflate zlib.c:244:12 #5 0x825dff in stream_to_pack bulk-checkin.c:140:12 #6 0x82467b in deflate_to_pack bulk-checkin.c:225:8 #7 0x823ff1 in index_bulk_checkin bulk-checkin.c:264:15 #8 0xa7cff2 in index_stream object-file.c:2234:9 #9 0xa7bff7 in index_fd object-file.c:2256:9 #10 0xa7d22d in index_path object-file.c:2274:7 #11 0xb3c8c9 in add_to_index read-cache.c:802:7 #12 0xb3e039 in add_file_to_index read-cache.c:835:9 #13 0x4a99c3 in add_files add.c:458:7 #14 0x4a7276 in cmd_add add.c:670:18 #15 0x4a1e76 in run_builtin git.c:461:11 #16 0x49e1e7 in handle_builtin git.c:714:3 #17 0x4a0c08 in run_argv git.c:781:4 #18 0x49d5a8 in cmd_main git.c:912:19 #19 0x7974da in main common-main.c:52:11 Uninitialized value was created by an allocation of 'ibuf' in the stack frame of function 'stream_to_pack' #0 0x825710 in stream_to_pack bulk-checkin.c:101 SUMMARY: MemorySanitizer: use-of-uninitialized-value crc32.c:283:9 in crc32_little Exiting Signed-off-by: Andrzej Hunt <andrzej@ahunt.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27Use the final_oid_fn to finalize hashing of object IDsbrian m. carlson
When we're hashing a value which is going to be an object ID, we want to zero-pad that value if necessary. To do so, use the final_oid_fn instead of the final_fn anytime we're going to create an object ID to ensure we perform this operation. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-14use CALLOC_ARRAYRené Scharfe
Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-06bulk-checkin: zero-initialize hashfile_checkpointJeff King
We declare a "struct hashfile_checkpoint" but only sometimes actually call hashfile_checkpoint() on it. That makes it not immediately obvious that it's valid when we later access its members. In fact, the code is fine: we fill it in unconditionally in the while(1) loop as long as "idx" is non-NULL. And then if "idx" is NULL, we exit early from the function (because we're just computing the hash, not actually writing), before we look at the struct. However, this does seem to confuse gcc 9.2.1's -Wmaybe-uninitialized when compiled with "-flto -O2" (probably because with LTO it can now realize that our call to hashfile_truncate() does not set the members either). Let's zero-initialize the struct to tell the compiler, as well as any readers of the code, that all is well. Reported-by: Stephan Beyer <s-beyer@gmx.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-08convert has_sha1_file() callers to has_object_file()Jeff King
The only remaining callers of has_sha1_file() actually have an object_id already. They can use the "object" variant, rather than dereferencing the hash themselves. The code changes here were completely generated by the included coccinelle patch. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-29convert "oidcmp() == 0" to oideq()Jeff King
Using the more restrictive oideq() should, in the long run, give the compiler more opportunities to optimize these callsites. For now, this conversion should be a complete noop with respect to the generated code. The result is also perhaps a little more readable, as it avoids the "zero is equal" idiom. Since it's so prevalent in C, I think seasoned programmers tend not to even notice it anymore, but it can sometimes make for awkward double negations (e.g., we can drop a few !!oidcmp() instances here). This patch was generated almost entirely by the included coccinelle patch. This mechanical conversion should be completely safe, because we check explicitly for cases where oidcmp() is compared to 0, which is what oideq() is doing under the hood. Note that we don't have to catch "!oidcmp()" separately; coccinelle's standard isomorphisms make sure the two are treated equivalently. I say "almost" because I did hand-edit the coccinelle output to fix up a few style violations (it mostly keeps the original formatting, but sometimes unwraps long lines). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-18Merge branch 'sb/object-store-grafts'Junio C Hamano
The conversion to pass "the_repository" and then "a_repository" throughout the object access API continues. * sb/object-store-grafts: commit: allow lookup_commit_graft to handle arbitrary repositories commit: allow prepare_commit_graft to handle arbitrary repositories shallow: migrate shallow information into the object parser path.c: migrate global git_path_* to take a repository argument cache: convert get_graft_file to handle arbitrary repositories commit: convert read_graft_file to handle arbitrary repositories commit: convert register_commit_graft to handle arbitrary repositories commit: convert commit_graft_pos() to handle arbitrary repositories shallow: add repository argument to is_repository_shallow shallow: add repository argument to check_shallow_file_for_update shallow: add repository argument to register_shallow shallow: add repository argument to set_alternate_shallow_file commit: add repository argument to lookup_commit_graft commit: add repository argument to prepare_commit_graft commit: add repository argument to read_graft_file commit: add repository argument to register_commit_graft commit: add repository argument to commit_graft_pos object: move grafts to object parser object-store: move object access functions to object-store.h
2018-05-30Merge branch 'js/use-bug-macro'Junio C Hamano
Developer support update, by using BUG() macro instead of die() to mark codepaths that should not happen more clearly. * js/use-bug-macro: BUG_exit_code: fix sparse "symbol not declared" warning Convert remaining die*(BUG) messages Replace all die("BUG: ...") calls by BUG() ones run-command: use BUG() to report bugs, not die() test-tool: help verifying BUG() code paths
2018-05-16object-store: move object access functions to object-store.hStefan Beller
This should make these functions easier to find and cache.h less overwhelming to read. In particular, this moves: - read_object_file - oid_object_info - write_object_file As a result, most of the codebase needs to #include object-store.h. In this patch the #include is only added to files that would fail to compile otherwise. It would be better to #include wherever identifiers from the header are used. That can happen later when we have better tooling for it. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-08Merge branch 'ds/commit-graph'Junio C Hamano
Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking. * ds/commit-graph: commit-graph: implement "--append" option commit-graph: build graph from starting commits commit-graph: read only from specific pack-indexes commit: integrate commit graph with commit parsing commit-graph: close under reachability commit-graph: add core.commitGraph setting commit-graph: implement git commit-graph read commit-graph: implement git-commit-graph write commit-graph: implement write_commit_graph() commit-graph: create git-commit-graph builtin graph: add commit graph design document commit-graph: add format document csum-file: refactor finalize_hashfile() method csum-file: rename hashclose() to finalize_hashfile()
2018-05-06Replace all die("BUG: ...") calls by BUG() onesJohannes Schindelin
In d8193743e08 (usage.c: add BUG() function, 2017-05-12), a new macro was introduced to use for reporting bugs instead of die(). It was then subsequently used to convert one single caller in 588a538ae55 (setup_git_env: convert die("BUG") to BUG(), 2017-05-12). The cover letter of the patch series containing this patch (cf 20170513032414.mfrwabt4hovujde2@sigill.intra.peff.net) is not terribly clear why only one call site was converted, or what the plan is for other, similar calls to die() to report bugs. Let's just convert all remaining ones in one fell swoop. This trick was performed by this invocation: sed -i 's/die("BUG: /BUG("/g' $(git grep -l 'die("BUG' \*.c) Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11Merge branch 'sb/packfiles-in-repository'Junio C Hamano
Refactoring of the internal global data structure continues. * sb/packfiles-in-repository: packfile: keep prepare_packed_git() private packfile: allow find_pack_entry to handle arbitrary repositories packfile: add repository argument to find_pack_entry packfile: allow reprepare_packed_git to handle arbitrary repositories packfile: allow prepare_packed_git to handle arbitrary repositories packfile: allow prepare_packed_git_one to handle arbitrary repositories packfile: add repository argument to reprepare_packed_git packfile: add repository argument to prepare_packed_git packfile: add repository argument to prepare_packed_git_one packfile: allow install_packed_git to handle arbitrary repositories packfile: allow rearrange_packed_git to handle arbitrary repositories packfile: allow prepare_packed_git_mru to handle arbitrary repositories
2018-04-02csum-file: refactor finalize_hashfile() methodDerrick Stolee
If we want to use a hashfile on the temporary file for a lockfile, then we need finalize_hashfile() to fully write the trailing hash but also keep the file descriptor open. Do this by adding a new CSUM_HASH_IN_STREAM flag along with a functional change that checks this flag before writing the checksum to the stream. This differs from previous behavior since it would be written if either CSUM_CLOSE or CSUM_FSYNC is provided. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-02csum-file: rename hashclose() to finalize_hashfile()Derrick Stolee
The hashclose() method behaves very differently depending on the flags parameter. In particular, the file descriptor is not always closed. Perform a simple rename of "hashclose()" to "finalize_hashfile()" in preparation for functional changes. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-26packfile: add repository argument to reprepare_packed_gitStefan Beller
See previous patch for explanation. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-14bulk-checkin: convert index_bulk_checkin to struct object_idbrian m. carlson
Convert the index_bulk_checkin function, and the static functions it calls, to use pointers to struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-06Merge branch 'bw/c-plus-plus'Junio C Hamano
Avoid using identifiers that clash with C++ keywords. Even though it is not a goal to compile Git with C++ compilers, changes like this help use of code analysis tools that targets C++ on our codebase. * bw/c-plus-plus: (37 commits) replace: rename 'new' variables trailer: rename 'template' variables tempfile: rename 'template' variables wrapper: rename 'template' variables environment: rename 'namespace' variables diff: rename 'template' variables environment: rename 'template' variables init-db: rename 'template' variables unpack-trees: rename 'new' variables trailer: rename 'new' variables submodule: rename 'new' variables split-index: rename 'new' variables remote: rename 'new' variables ref-filter: rename 'new' variables read-cache: rename 'new' variables line-log: rename 'new' variables imap-send: rename 'new' variables http: rename 'new' variables entry: rename 'new' variables diffcore-delta: rename 'new' variables ...
2018-02-14object: rename function 'typename' to 'type_name'Brandon Williams
Rename C++ keyword in order to bring the codebase closer to being able to be compiled with a C++ compiler. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-02-02bulk-checkin: abstract SHA-1 usagebrian m. carlson
Convert uses of the direct SHA-1 functions to use the_hash_algo instead. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-02-02csum-file: rename sha1file to hashfilebrian m. carlson
Rename struct sha1file to struct hashfile, along with all of its related functions. The transformation in this commit was made by global search-and-replace. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-27distinguish error versus short read from read_in_full()Jeff King
Many callers of read_in_full() expect to see the exact number of bytes requested, but their error handling lumps together true read errors and short reads due to unexpected EOF. We can give more specific error messages by separating these cases (showing errno when appropriate, and otherwise describing the short read). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-08-23pack: move {,re}prepare_packed_git and approximate_object_countJonathan Tan
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-08pack: convert struct pack_idx_entry to struct object_idbrian m. carlson
Convert struct pack_idx_entry to use struct object_id by changing the definition and applying the following semantic patch, plus the standard object_id transforms: @@ struct pack_idx_entry E1; @@ - E1.sha1 + E1.oid.hash @@ struct pack_idx_entry *E1; @@ - E1->sha1 + E1->oid.hash Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-24encode_in_pack_object_header: respect output buffer lengthJeff King
The encode_in_pack_object_header() writes a variable-length header to an output buffer, but it doesn't actually know long the buffer is. At first glance, this looks like it might be possible to overflow. In practice, this is probably impossible. The smallest buffer we use is 10 bytes, which would hold the header for an object up to 2^67 bytes. Obviously we're not likely to see such an object, but we might worry that an object could lie about its size (causing us to overflow before we realize it does not actually have that many bytes). But the argument is passed as a uintmax_t. Even on systems that have __int128 available, uintmax_t is typically restricted to 64-bit by the ABI. So it's unlikely that a system exists where this could be exploited. Still, it's easy enough to use a normal out/len pair and make sure we don't write too far. That protects the hypothetical 128-bit system, makes it harder for callers to accidentally specify a too-small buffer, and makes the resulting code easier to audit. Note that the one caller in fast-import tried to catch such a case, but did so _after_ the call (at which point we'd have already overflowed!). This check can now go away. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-11-16compression: unify pack.compression configuration parsingJunio C Hamano
There are three codepaths that use a variable whose name is pack_compression_level to affect how objects and deltas sent to a packfile is compressed. Unlike zlib_compression_level that controls the loose object compression, however, this variable was static to each of these codepaths. Two of them read the pack.compression configuration variable, using core.compression as the default, and one of them also allowed overriding it from the command line. The other codepath in bulk-checkin did not pay any attention to the configuration. Unify the configuration parsing to git_default_config(), where we implement the parsing of core.loosecompression and core.compression and make the former override the latter, by moving code to parse pack.compression and also allow core.compression to give default to this variable. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-09-25use xsnprintf for generating git object headersJeff King
We generally use 32-byte buffers to format git's "type size" header fields. These should not generally overflow unless you can produce some truly gigantic objects (and our types come from our internal array of constant strings). But it is a good idea to use xsnprintf to make sure this is the case. Note that we slightly modify the interface to write_sha1_file_prepare, which nows uses "hdrlen" as an "in" parameter as well as an "out" (on the way in it stores the allocated size of the header, and on the way out it returns the ultimate size of the header). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-05-06Merge branch 'bc/object-id'Junio C Hamano
Identify parts of the code that knows that we use SHA-1 hash to name our objects too much, and use (1) symbolic constants instead of hardcoded 20 as byte count and/or (2) use struct object_id instead of unsigned char [20] for object names. * bc/object-id: apply: convert threeway_stage to object_id patch-id: convert to use struct object_id commit: convert parts to struct object_id diff: convert struct combine_diff_path to object_id bulk-checkin.c: convert to use struct object_id zip: use GIT_SHA1_HEXSZ for trailers archive.c: convert to use struct object_id bisect.c: convert leaf functions to use struct object_id define utility functions for object IDs define a structure for object IDs
2015-03-17Merge branch 'rs/deflate-init-cleanup'Junio C Hamano
Code simplification. * rs/deflate-init-cleanup: zlib: initialize git_zstream in git_deflate_init{,_gzip,_raw}
2015-03-14bulk-checkin.c: convert to use struct object_idbrian m. carlson
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-03-05zlib: initialize git_zstream in git_deflate_init{,_gzip,_raw}René Scharfe
Clear the git_zstream variable at the start of git_deflate_init() etc. so that callers don't have to do that. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-09-15cleanups: ensure that git-compat-util.h is included firstDavid Aguilar
CodingGuidelines states that the first #include in C files should be git-compat-util.h or another header file that includes it, such as cache.h or builtin.h. Signed-off-by: David Aguilar <davvid@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-03finish_tmp_packfile():use strbuf for pathname constructionSun He
The old version fixes a maximum length on the buffer, which could be a problem if one is not certain of the length of get_object_directory(). Using strbuf can avoid the protential bug. Helped-by: Michael Haggerty <mhagger@alum.mit.edu> Helped-by: Eric Sunshine <sunshine@sunshineco.com> Signed-off-by: Sun He <sunheehnus@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-20stream_to_pack: xread does not guarantee to read all requested bytesJohannes Sixt
The deflate loop in bulk-checkin::stream_to_pack expects to get all bytes from a file that it requests to read in a single function call. But it used xread(), which does not give that guarantee. Replace it by read_in_full(). Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-12-01bulk-checkin: replace fast-import based implementationJunio C Hamano
This extends the earlier approach to stream a large file directly from the filesystem to its own packfile, and allows "git add" to send large files directly into a single pack. Older code used to spawn fast-import, but the new bulk-checkin API replaces it. Signed-off-by: Junio C Hamano <gitster@pobox.com>