summaryrefslogtreecommitdiff
path: root/csum-file.c
AgeCommit message (Collapse)Author
2023-09-13Merge branch 'ew/hash-with-openssl-evp'Junio C Hamano
Fix-up new-ish code to support OpenSSL EVP API. * ew/hash-with-openssl-evp: treewide: fix various bugs w/ OpenSSL 3+ EVP API
2023-09-01treewide: fix various bugs w/ OpenSSL 3+ EVP APIEric Wong
The OpenSSL 3+ EVP API for SHA-* cannot support our prior use cases supported by other SHA-* implementations. It has the following differences: 1. ->init_fn is required before all use 2. struct assignments don't work and requires ->clone_fn 3. can't support ->update_fn after ->final_*fn While fixing cases 1 and 2 is merely the matter of calling ->init_fn and ->clone_fn as appropriate, fixing case 3 requires calling ->final_*fn on a temporary context that's cloned from the primary context. Reported-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/ZPCL11k38PXTkFga@debian.me/ Helped-by: brian m. carlson <sandals@crustytoothpaste.net> Fixes: 3e440ea0aba0 ("sha256: avoid functions deprecated in OpenSSL 3+") Fixes: bda9c12073e7 ("avoid SHA-1 functions deprecated in OpenSSL 3+") Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-05treewide: remove unnecessary includes for wrapper.hCalvin Wan
Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-24hash-ll.h: split out of hash.h to remove dependency on repository.hElijah Newren
hash.h depends upon and includes repository.h, due to the definition and use of the_hash_algo (defined as the_repository->hash_algo). However, most headers trying to include hash.h are only interested in the layout of the structs like object_id. Move the parts of hash.h that do not depend upon repository.h into a new file hash-ll.h (the "low level" parts of hash.h), and adjust other files to use this new header where the convenience inline functions aren't needed. This allows hash.h and object.h to be fairly small, minimal headers. It also exposes a lot of hidden dependencies on both path.h (which was brought in by repository.h) and repository.h (which was previously implicitly brought in by object.h), so also adjust other files to be more explicit about what they depend upon. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21treewide: remove cache.h inclusion due to environment.h changesElijah Newren
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21wrapper.h: move declarations for wrapper.c functions from cache.hElijah Newren
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-01-06hashfile: allow skipping the hash functionDerrick Stolee
The hashfile API is useful for generating files that include a trailing hash of the file's contents up to that point. Using such a hash is helpful for verifying the file for corruption-at-rest, such as a faulty drive causing flipped bits. Git's index file includes this trailing hash, so it uses a 'struct hashfile' to handle the I/O to the file. This was very convenient to allow using the hashfile methods during these operations. However, hashing the file contents during write comes at a performance penalty. It's slower to hash the bytes on their way to the disk than without that step. This problem is made worse by the replacement of hardware-accelerated SHA1 computations with the software-based sha1dc computation. This write cost is significant, and the checksum capability is likely not worth that cost for such a short-lived file. The index is rewritten frequently and the only time the checksum is checked is during 'git fsck'. Thus, it would be helpful to allow a user to opt-out of the hash computation. We first need to allow Git to opt-out of the hash computation in the hashfile API. The buffered writes of the API are still helpful, so it makes sense to make the change here. Introduce a new 'skip_hash' option to 'struct hashfile'. When set, the update_fn and final_fn members of the_hash_algo are skipped. When finalizing the hashfile, the trailing hash is replaced with the null hash. This use of a trailing null hash would be desireable in either case, since we do not want to special case a file format to have a different length depending on whether it was hashed or not. When the final bytes of a file are all zero, we can infer that it was written without hashing, and thus that verification is not available as a check for file consistency. This also means that we could easily toggle hashing for any file format we desire. A version of this patch has existed in the microsoft/git fork since 2017 [1] (the linked commit was rebased in 2018, but the original dates back to January 2017). Here, the change to make the index use this fast path is delayed until a later change. [1] https://github.com/microsoft/git/commit/21fed2d91410f45d85279467f21d717a2db45201 Co-authored-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Kevin Willford <kewillf@microsoft.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-10core.fsync: introduce granular fsync control infrastructureNeeraj Singh
This commit introduces the infrastructure for the core.fsync configuration knob. The repository components we want to sync are identified by flags so that we can turn on or off syncing for specific components. If core.fsyncObjectFiles is set and the core.fsync configuration also includes FSYNC_COMPONENT_LOOSE_OBJECT, we will fsync any loose objects. This picks the strictest data integrity behavior if core.fsync and core.fsyncObjectFiles are set to conflicting values. This change introduces the currently unused fsync_component helper, which will be used by a later patch that adds fsyncing to the refs backend. Actual configuration and documentation of the fsync components list are in other patches in the series to separate review of the underlying mechanism from the policy of how it's configured. Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-25use xopen() to handle fatal open(2) failuresRené Scharfe
Add and apply a semantic patch for using xopen() instead of calling open(2) and die() or die_errno() explicitly. This makes the error messages more consistent and shortens the code. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-17Merge branch 'tb/midx-use-checksum'Junio C Hamano
When rebuilding the multi-pack index file reusing an existing one, we used to blindly trust the existing file and ended up carrying corrupted data into the updated file, which has been corrected. * tb/midx-use-checksum: midx: report checksum mismatches during 'verify' midx: don't reuse corrupt MIDXs when writing commit-graph: rewrite to use checksum_valid() csum-file: introduce checksum_valid()
2021-06-29csum-file: introduce checksum_valid()Taylor Blau
Introduce a new function which checks the validity of a file's trailing checksum. This is similar to hashfd_check(), but different since it is intended to be used by callers who aren't writing the same data (like `git index-pack --verify`), but who instead want to validate the integrity of data that they are reading. Rewrite the first of two callers which could benefit from this new function in pack-check.c. Subsequent callers will be added in the following patches. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-19csum-file.h: increase hashfile buffer sizeDerrick Stolee
The hashfile API uses a hard-coded buffer size of 8KB and has ever since it was introduced in c38138c (git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26). It performs a similar function to the hashing buffers in read-cache.c, but that code was updated from 8KB to 128KB in f279894 (read-cache: make the index write buffer size 128K, 2021-02-18). The justification there was that do_write_index() improves from 1.02s to 0.72s. Since our end goal is to have the index writing code use the hashfile API, we need to unify this buffer size to avoid a performance regression. There is a buffer, 'check_buffer', that is used to verify the check_fd file descriptor. When this buffer increases to 128K to fit the data being flushed, it causes the stack to overflow the limits placed in the test suite. To avoid issues with stack size, move both 'buffer' and 'check_buffer' to be heap pointers within 'struct hashfile'. The 'check_buffer' member is left as NULL unless check_fd is set in hashfd_check(). Both buffers are cleared as part of finalize_hashfile() which also frees the full structure. Since these buffers are now on the heap, we can adjust their size based on the needs of the consumer. In particular, callers to hashfd_throughput() are expecting to report progress indicators as the buffer flushes. These callers would prefer the smaller 8k buffer to avoid large delays between updates, especially for users with slower networks. When the progress indicator is not used, the larger buffer is preferrable. By adding a new trace2 region in the chunk-format API, we can see that the writing portion of 'git multi-pack-index write' lowers from ~1.49s to ~1.47s on a Linux machine. These effects may be more pronounced or diminished on other filesystems. The end-to-end timing is too noisy to have a definitive change either way. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-17hashfile: use write_in_full()Derrick Stolee
The flush() logic in csum-file.c was introduced originally by c38138c (git-pack-objects: write the pack files with a SHA1 csum, 2005-06-26) and a portion of the logic performs similar utility to write_in_full() in wrapper.c. The history of write_in_full() is full of moves and renames, but was originally introduced by 7230e6d (Add write_or_die(), a helper function, 2006-08-21). The point of these sections of code are to flush a write buffer using xwrite() and report errors in the case of disk space issues or other generic input/output errors. The logic in flush() can interpret the output of write_in_full() to provide the correct error messages to users. The logic in the hashfile API has an additional set of logic to augment the progress indicator between calls to xwrite(). This was introduced by 2a128d6 (add throughput display to git-push, 2007-10-30). It seems that since the hashfile's buffer is only 8KB, these additional progress indicators might not be incredibly necessary. Instead, update the progress only when write_in_full() complete. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-26csum-file: make hashwrite() more readableDerrick Stolee
The hashwrite() method takes an input buffer and updates a hashfile's hash function while writing the data to a file. To avoid overuse of flushes, the hashfile has an internal buffer and most writes will use memcpy() to transfer data from the input 'buf' to the hashfile's buffer of size 8 * 1024 bytes. Logic introduced by a8032d12 (sha1write: don't copy full sized buffers, 2008-09-02) reduces the number of memcpy() calls when the input buffer is sufficiently longer than the hashfile's buffer, causing nr to be the length of the full buffer. In these cases, the input buffer is used directly in chunks equal to the hashfile's buffer size. This method caught my attention while investigating some performance issues, but it turns out that these performance issues were noise within the variance of the experiment. However, during this investigation, I inspected hashwrite() and misunderstood it, even after looking closely and trying to make it faster. This change simply reorganizes some parts of the loop within hashwrite() to make it clear that each batch either uses memcpy() to the hashfile's buffer or writes directly from the input buffer. The previous code relied on indirection through local variables and essentially inlined the implementation of hashflush() to reduce lines of code. Helped-by: Jeff King <peff@peff.net> Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-24hash: implement and use a context cloning functionbrian m. carlson
For all of our SHA-1 implementations and most of our SHA-256 implementations, the hash context we use is a real struct. For these implementations, it's possible to copy a hash context by making a copy of the struct. However, for our libgcrypt implementation, our hash context is a pointer. Consequently, copying it does not lead to an independent hash context like we intended. Fortunately, however, libgcrypt provides us with a handy function to copy hash contexts. Let's add a cloning function to the hash algorithm API, and use it in the one place we need to make a hash context copy. With this change, our libgcrypt SHA-256 implementation is fully functional with all of our other hash implementations. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-02csum-file: refactor finalize_hashfile() methodDerrick Stolee
If we want to use a hashfile on the temporary file for a lockfile, then we need finalize_hashfile() to fully write the trailing hash but also keep the file descriptor open. Do this by adding a new CSUM_HASH_IN_STREAM flag along with a functional change that checks this flag before writing the checksum to the stream. This differs from previous behavior since it would be written if either CSUM_CLOSE or CSUM_FSYNC is provided. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-02csum-file: rename hashclose() to finalize_hashfile()Derrick Stolee
The hashclose() method behaves very differently depending on the flags parameter. In particular, the file descriptor is not always closed. Perform a simple rename of "hashclose()" to "finalize_hashfile()" in preparation for functional changes. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-02-02csum-file: abstract uses of SHA-1brian m. carlson
Convert several direct uses of SHA-1 to use the_hash_algo instead. Convert one use of the constant 20 as well. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-02-02csum-file: rename sha1file to hashfilebrian m. carlson
Rename struct sha1file to struct hashfile, along with all of its related functions. The transformation in this commit was made by global search-and-replace. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-27prefer "!=" when checking read_in_full() resultJeff King
Comparing the result of read_in_full() using less-than is potentially dangerous, as a negative return value may be converted to an unsigned type and be considered a success. This is discussed further in 561598cfcf (read_pack_header: handle signed/unsigned comparison in read result, 2017-09-13). Each of these instances is actually fine in practice: - in get-tar-commit-id, the HEADERSIZE macro expands to a signed integer. If it were switched to an unsigned type (e.g., a size_t), then it would be a bug. - the other two callers check for a short read only after handling a negative return separately. This is a fine practice, but we'd prefer to model "!=" as a general rule. So all of these cases can be considered cleanups and not actual bugfixes. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-03-19sha1fd_check: die when we cannot open the fileJeff King
Right now we return a NULL "struct sha1file" if we encounter an error. However, the sole caller (write_idx_file) does not check the return value, and will segfault if we hit this case. One option would be to handle the error in the caller. However, there's really nothing for it to do but die. This code path is hit during "git index-pack --verify"; after we verify the packfile, we check that the ".idx" we would generate from it is byte-wise identical to what is on disk. We hit the error (and segfault) if we can't open the .idx file (a likely cause of this is that somebody else ran "git repack -ad" while we were verifying). Since we can't complete the requested verification, we really have no choice but to die. Furthermore, the rest of the sha1fd_* functions simply die on errors. So if were to open the file successfully, for example, and then hit a read error, sha1write would call die() for us. So pushing the die() down into sha1fd_check keeps the interface consistent. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-01-10Merge branch 'jk/sha1write-void'Junio C Hamano
Code clean-up. * jk/sha1write-void: do not pretend sha1write returns errors
2013-12-26do not pretend sha1write returns errorsJeff King
The sha1write function returns an int, but it will always be "0". The failure-prone parts of the function happen in the "flush" callback, which cannot pass an error back to us. So we just end up calling die() during the flush. Let's just drop the return value altogether, as it only confuses callers into thinking that it might be useful. Only one call site actually checked the return value. We can drop that check, since it just led to a die() anyway. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-10-24sha1write: make buffer const-correctJeff King
We are passed a "void *" and write it out without ever touching it; let's indicate that by using "const". Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-11-30csum-file: introduce sha1file_checkpointJunio C Hamano
It is useful to be able to rewind a check-summed file to a certain previous state after writing data into it using sha1write() API. The fast-import command does this after streaming a blob data to the packfile being generated and then noticing that the same blob has already been written, and it does this with a private code truncate_pack() that is commented as "Yes, this is a layering violation". Introduce two API functions, sha1file_checkpoint(), that allows the caller to save a state of a sha1file, and then later revert it to the saved state. Use it to reimplement truncate_pack(). Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-07-19Merge branch 'jc/index-pack'Junio C Hamano
* jc/index-pack: verify-pack: use index-pack --verify index-pack: show histogram when emulating "verify-pack -v" index-pack: start learning to emulate "verify-pack -v" index-pack: a miniscule refactor index-pack --verify: read anomalous offsets from v2 idx file write_idx_file: need_large_offset() helper function index-pack: --verify write_idx_file: introduce a struct to hold idx customization options index-pack: group the delta-base array entries also by type Conflicts: builtin/verify-pack.c cache.h sha1_file.c
2011-04-03sparse: Fix errors and silence warningsStephen Boyd
* load_file() returns a void pointer but is using 0 for the return value * builtin/receive-pack.c forgot to include builtin.h * packet_trace_prefix can be marked static * ll_merge takes a pointer for its last argument, not an int * crc32 expects a pointer as the second argument but Z_NULL is defined to be 0 (see 38f4d13 sparse fix: Using plain integer as NULL pointer, 2006-11-18 for more info) Signed-off-by: Stephen Boyd <bebarino@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-02-28index-pack: --verifyJunio C Hamano
Given an existing .pack file and the .idx file that describes it, this new mode of operation reads and re-index the packfile and makes sure the existing .idx file matches the result byte-for-byte. All the objects in the .pack file are validated during this operation as well. Unlike verify-pack, which visits each object described in the .idx file in the SHA-1 order, index-pack efficiently exploits the delta-chain to avoid rebuilding the objects that are used as the base of deltified objects over and over again while validating the objects, resulting in much quicker verification of the .pack file and its .idx file. This version however cannot verify a .pack/.idx pair with a handcrafted v2 index that uses 64-bit offset representation for offsets that would fit within 31-bit. You can create such an .idx file by giving a custom offset to --index-version option to the command. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-06-27Convert existing die(..., strerror(errno)) to die_errno()Thomas Rast
Change calls to die(..., strerror(errno)) to use the new die_errno(). In the process, also make slight style adjustments: at least state _something_ about the function that failed (instead of just printing the pathname), and put paths in single quotes. Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-10-10Merge branch 'maint'Shawn O. Pearce
* maint: rebase -i: do not fail when there is no commit to cherry-pick test-lib: fix color reset in say_color() fix pread()'s short read in index-pack Conflicts: csum-file.c
2008-10-10fix pread()'s short read in index-packNicolas Pitre
Since v1.6.0.2~13^2~ the completion of a thin pack uses sha1write() for its ability to compute a SHA1 on the written data. This also provides data buffering which, along with commit 92392b4a45, will confuse pread() whenever an appended object is 1) freed due to memory pressure because of the depth-first delta processing, and 2) needed again because it has many delta children, and 3) its data is still buffered by sha1write(). Let's fix the issue by simply forcing cached data out when such an object is written so it can be pread()'d at leisure. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
2008-10-03fix openssl headers conflicting with custom SHA1 implementationsNicolas Pitre
On ARM I have the following compilation errors: CC fast-import.o In file included from cache.h:8, from builtin.h:6, from fast-import.c:142: arm/sha1.h:14: error: conflicting types for 'SHA_CTX' /usr/include/openssl/sha.h:105: error: previous declaration of 'SHA_CTX' was here arm/sha1.h:16: error: conflicting types for 'SHA1_Init' /usr/include/openssl/sha.h:115: error: previous declaration of 'SHA1_Init' was here arm/sha1.h:17: error: conflicting types for 'SHA1_Update' /usr/include/openssl/sha.h:116: error: previous declaration of 'SHA1_Update' was here arm/sha1.h:18: error: conflicting types for 'SHA1_Final' /usr/include/openssl/sha.h:117: error: previous declaration of 'SHA1_Final' was here make: *** [fast-import.o] Error 1 This is because openssl header files are always included in git-compat-util.h since commit 684ec6c63c whenever NO_OPENSSL is not set, which somehow brings in <openssl/sha1.h> clashing with the custom ARM version. Compilation of git is probably broken on PPC too for the same reason. Turns out that the only file requiring openssl/ssl.h and openssl/err.h is imap-send.c. But only moving those problematic includes there doesn't solve the issue as it also includes cache.h which brings in the conflicting local SHA1 header file. As suggested by Jeff King, the best solution is to rename our references to SHA1 functions and structure to something git specific, and define those according to the implementation used. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
2008-09-03sha1write: don't copy full sized buffersNicolas Pitre
No need to memcpy() source buffer data when we might just process the data in place instead of accumulating it into a separate buffer. This is the case when a whole buffer would have been copied, summed, written out and then discarded right away. Also move the CRC32 processing within the loop so the data is more likely to remain in the L1 CPU cache between the CRC32 sum, SHA1 sum and the write call. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-08-30pack-objects: use fixup_pack_header_footer()'s validation modeNicolas Pitre
When limiting the pack size, a new header has to be written to the pack and a new SHA1 computed. Make sure that the SHA1 of what is being read back matches the SHA1 of what was written. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-05-31Make pack creation always fsync() the resultLinus Torvalds
This means that we can depend on packs always being stable on disk, simplifying a lot of the object serialization worries. And unlike loose objects, serializing pack creation IO isn't going to be a performance killer. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-11-05remove dead code from the csum-file interfaceNicolas Pitre
The provided name argument is always constant and valid in every caller's context, so no need to have an array of PATH_MAX chars to copy it into when a simple pointer will do. Unfortunately that means getting rid of wascally wabbits too. The 'error' field is also unused. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-11-05make display of total transferred more accurateNicolas Pitre
The throughput display needs a delay period before accounting and displaying anything. Yet it might be called after some amount of data has already been transferred. The display of total data is therefore accounted late and therefore smaller than the reality. Let's call display_throughput() with an absolute amount of transferred data instead of a relative number, and let the throughput code find the relative amount of data by itself as needed. This way the displayed total is always exact. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-30add throughput display to git-pushNicolas Pitre
This one triggers only when git-pack-objects is called with --all-progress and --stdout which is the combination used by git-push. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-10-17pack-objects.c: fix some global variable abuse and memory leaksNicolas Pitre
To keep things well layered, sha1close() now returns the file descriptor when it doesn't close the file. An ugly cast was added to the return of write_idx_file() to avoid a warning. A proper fix will come separately. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
2007-06-13More staticJunio C Hamano
There still are quite a few symbols that ought to be static. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-05-21Alter sha1close() 3rd argument to request flush onlyDana L. How
update=0 suppressed writing the final SHA-1 but was not used. Now final=0 suppresses SHA-1 finalization, SHA-1 writing, and closing -- in other words, only flush the buffer. Signed-off-by: Dana L. How <danahow@gmail.com> Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-10Custom compression levels for objects and packsDana How
Add config variables pack.compression and core.loosecompression , and switch --compression=level to pack-objects. Loose objects will be compressed using core.loosecompression if set, else core.compression if set, else Z_BEST_SPEED. Packed objects will be compressed using --compression=level if seen, else pack.compression if set, else core.compression if set, else Z_DEFAULT_COMPRESSION. This is the "pack compression level". Loose objects added to a pack undeltified will be recompressed to the pack compression level if it is unequal to the current loose compression level by the preceding rules, or if the loose object was written while core.legacyheaders = true. Newly deltified loose objects are always compressed to the current pack compression level. Previously packed objects added to a pack are recompressed to the current pack compression level exactly when their deltification status changes, since the previous pack data cannot be reused. In either case, the --no-reuse-object switch from the first patch below will always force recompression to the current pack compression level, instead of assuming the pack compression level hasn't changed and pack data can be reused when possible. This applies on top of the following patches from Nicolas Pitre: [PATCH] allow for undeltified objects not to be reused [PATCH] make "repack -f" imply "pack-objects --no-reuse-object" Signed-off-by: Dana L. How <danahow@gmail.com> Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-10compute a CRC32 for each object as stored in a packNicolas Pitre
The most important optimization for performance when repacking is the ability to reuse data from a previous pack as is and bypass any delta or even SHA1 computation by simply copying the raw data from one pack to another directly. The problem with this is that any data corruption within a copied object would go unnoticed and the new (repacked) pack would be self-consistent with its own checksum despite containing a corrupted object. This is a real issue that already happened at least once in the past. In some attempt to prevent this, we validate the copied data by inflating it and making sure no error is signaled by zlib. But this is still not perfect as a significant portion of a pack content is made of object headers and references to delta base objects which are not deflated and therefore not validated when repacking actually making the pack data reuse still not as safe as it could be. Of course a full SHA1 validation could be performed, but that implies full data inflating and delta replaying which is extremely costly, which cost the data reuse optimization was designed to avoid in the first place. So the best solution to this is simply to store a CRC32 of the raw pack data for each object in the pack index. This way any object in a pack can be validated before being copied as is in another pack, including header and any other non deflated data. Why CRC32 instead of a faster checksum like Adler32? Quoting Wikipedia: Jonathan Stone discovered in 2001 that Adler-32 has a weakness for very short messages. He wrote "Briefly, the problem is that, for very short packets, Adler32 is guaranteed to give poor coverage of the available bits. Don't take my word for it, ask Mark Adler. :-)" The problem is that sum A does not wrap for short messages. The maximum value of A for a 128-byte message is 32640, which is below the value 65521 used by the modulo operation. An extended explanation can be found in RFC 3309, which mandates the use of CRC32 instead of Adler-32 for SCTP, the Stream Control Transmission Protocol. In the context of a GIT pack, we have lots of small objects, especially deltas, which are likely to be quite small and in a size range for which Adler32 is dimed not to be sufficient. Another advantage of CRC32 is the possibility for recovery from certain types of small corruptions like single bit errors which are the most probable type of corruptions. OK what this patch does is to compute the CRC32 of each object written to a pack within pack-objects. It is not written to the index yet and it is obviously not validated when reusing pack data yet either. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-08-23Convert memcpy(a,b,20) to hashcpy(a,b).Shawn Pearce
This abstracts away the size of the hash values when copying them from memory location to memory location, much as the introduction of hashcmp abstracted away hash value comparsion. A few call sites were using char* rather than unsigned char* so I added the cast rather than open hashcpy to be void*. This is a reasonable tradeoff as most call sites already use unsigned char* and the existing hashcmp is also declared to be unsigned char*. [jc: Splitted the patch to "master" part, to be followed by a patch for merge-recursive.c which is not in "master" yet. Fixed the cast in the latter hunk to combine-diff.c which was wrong in the original. Also converted ones left-over in combine-diff.c, diff-lib.c and upload-pack.c ] Signed-off-by: Shawn O. Pearce <spearce@spearce.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-08-15Make sha1flush void and remove conditional return.David Rientjes
Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-07-03Make zlib compression level configurable, and change default.Joachim B Haga
With the change in default, "git add ." on kernel dir is about twice as fast as before, with only minimal (0.5%) change in object size. The speed difference is even more noticeable when committing large files, which is now up to 8 times faster. The configurability is through setting core.compression = [-1..9] which maps to the zlib constants; -1 is the default, 0 is no compression, and 1..9 are various speed/size tradeoffs, 9 being slowest. Signed-off-by: Joachim B Haga (cjhaga@fys.uio.no) Acked-by: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-06-20Remove all void-pointer arithmetic.Florian Forster
ANSI C99 doesn't allow void-pointer arithmetic. This patch fixes this in various ways. Usually the strategy that required the least changes was used. Signed-off-by: Florian Forster <octo@verplant.org> Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-12-20xread/xwrite: do not worry about EINTR at calling sites.Junio C Hamano
We had errno==EINTR check after read(2)/write(2) sprinkled all over the places, always doing continue. Consolidate them into xread()/xwrite() wrapper routines. Credits for suggestion goes to HPA -- bugs are mine. Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-08-09[PATCH] Plug memory leak in sha1close()Sergey Vlasov
sha1create() and sha1fd() malloc the returned struct sha1file; sha1close() should free it. Signed-off-by: Sergey Vlasov <vsu@altlinux.ru> Signed-off-by: Junio C Hamano <junkio@cox.net>
2005-07-06[PATCH] Let umask do its work upon filesystem object creation.Junio C Hamano
IIRC our strategy was to let the users' umask take care of the final mode bits. This patch fixes places that deviate from it. Signed-off-by: Junio C Hamano <junkio@cox.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org>