summaryrefslogtreecommitdiff
path: root/builtin/index-pack.c
AgeCommit message (Collapse)Author
2017-03-16index-pack: make pointer-alias fallbacks saferJeff King
The final() function accepts a NULL value for certain parameters, and falls back to writing into a reusable "name" buffer, and then either: 1. For "keep_name", requiring all uses to do "keep_name ? keep_name : name.buf". This is awkward, and it's easy to accidentally look at the maybe-NULL keep_name. 2. For "final_index_name" and "final_pack_name", aliasing those pointers to the "name" buffer. This is easier to use, but the aliased pointers become invalid after the buffer is reused (this isn't a bug now, but it's a potential pitfall). One way to make this safer would be to introduce an extra pointer to do the aliasing, and have its lifetime match the validity of the "name" buffer. But it's still easy to accidentally use the wrong name (i.e., to use "final_pack_name" instead of the aliased pointer). Instead, let's use three separate buffers that will remain valid through the function. That makes it safe to alias the pointers and use them consistently. The extra allocations shouldn't matter, as this function is not performance sensitive. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-16replace snprintf with odb_pack_name()Jeff King
In several places we write the name of the pack filename into a fixed-size buffer using snprintf(), but do not check the return value. As a result, a very long object directory could cause us to quietly truncate the pack filename (potentially leading to a corrupted repository, as a newly written packfile could be missing its .pack extension). We can use odb_pack_name() to do this with a strbuf (and shorten the code, as well). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-16odb_pack_keep(): stop generating keepfile nameJeff King
The odb_pack_keep() function generates the name of a .keep file and opens it. This has two problems: 1. It requires a fixed-size buffer to create the filename and doesn't notice when the result is truncated. 2. Of the two callers, one sometimes wants to open a filename it already has, which makes things awkward (it has to do so manually, and skips the leading-directory creation). Instead, let's have odb_pack_keep() just open the file. Generating the name isn't hard, and a future patch will switch callers over to odb_pack_name() anyway. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-12-16index-pack: skip collision check when not in repositoryJeff King
You can run "git index-pack path/to/foo.pack" outside of a repository to generate an index file, or just to verify the contents. There's no point in doing a collision check, since we obviously do not have any objects to collide with. The current code will blindly look in .git/objects based on the result of setup_git_env(). That effectively gives us the right answer (since we won't find any objects), but it's a waste of time, and it conflicts with our desire to eventually get rid of the "fallback to .git" behavior of setup_git_env(). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-12-16index-pack: complain when --stdin is used outside of a repoJeff King
The index-pack builtin is marked as RUN_SETUP_GENTLY, because it's perfectly fine to index a pack in the filesystem outside of any repository. However, --stdin mode will write the result to the object database, which does not make sense outside of a repository. Doing so creates a bogus ".git" directory with nothing in it except the newly-created pack and its index. Instead, let's flag this as an error and abort. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-30use QSORT, part 2René Scharfe
Convert two more qsort(3) calls to QSORT to reduce code size and for better safety and consistency. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-29use QSORTRené Scharfe
Apply the semantic patch contrib/coccinelle/qsort.cocci to the code base, replacing calls of qsort(3) with QSORT. The resulting code is shorter and supports empty arrays with NULL pointers. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-24index-pack: add --max-input-size=<size> optionJeff King
When receiving a pack-file, it can be useful to abort the `git index-pack`, if the pack-file is too big. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-03Merge branch 'jk/push-progress'Junio C Hamano
"git push" and "git clone" learned to give better progress meters to the end user who is waiting on the terminal. * jk/push-progress: receive-pack: send keepalives during quiet periods receive-pack: turn on connectivity progress receive-pack: relay connectivity errors to sideband receive-pack: turn on index-pack resolving progress index-pack: add flag for showing delta-resolution progress clone: use a real progress meter for connectivity check check_connected: add progress flag check_connected: relay errors to alternate descriptor check_everything_connected: use a struct with named options check_everything_connected: convert to argv_array rev-list: add optional progress reporting check_everything_connected: always pass --quiet to rev-list
2016-07-20receive-pack: send keepalives during quiet periodsJeff King
After a client has sent us the complete pack, we may spend some time processing the data and running hooks. If the client asked us to be quiet, receive-pack won't send any progress data during the index-pack or connectivity-check steps. And hooks may or may not produce their own progress output. In these cases, the network connection is totally silent from both ends. Git itself doesn't care about this (it will wait forever), but other parts of the system (e.g., firewalls, load-balancers, etc) might hang up the connection. So we'd like to send some sort of keepalive to let the network and the client side know that we're still alive and processing. We can use the same trick we did in 05e9515 (upload-pack: send keepalive packets during pack computation, 2013-09-08). Namely, we will send an empty sideband data packet every `N` seconds that we do not relay any stderr data over the sideband channel. As with 05e9515, this means that we won't bother sending keepalives when there's actual progress data, but will kick in when it has been disabled (or if there is a lull in the progress data). The concept is simple, but the details are subtle enough that they need discussing here. Before the client sends us the pack, we don't want to do any keepalives. We'll have sent our ref advertisement, and we're waiting for them to send us the pack (and tell us that they support sidebands at all). While we're receiving the pack from the client (or waiting for it to start), there's no need for keepalives; it's up to them to keep the connection active by sending data. Moreover, it would be wrong for us to do so. When we are the server in the smart-http protocol, we must treat our connection as half-duplex. So any keepalives we send while receiving the pack would potentially be buffered by the webserver. Not only does this make them useless (since they would not be delivered in a timely manner), but it could actually cause a deadlock if we fill up the buffer with keepalives. (It wouldn't be wrong to send keepalives in this phase for a full-duplex connection like ssh; it's simply pointless, as it is the client's responsibility to speak). As soon as we've gotten all of the pack data, then the client is waiting for us to speak, and we should start keepalives immediately. From here until the end of the connection, we send one any time we are not otherwise sending data. But there's a catch. Receive-pack doesn't know the moment we've gotten all the data. It passes the descriptor to index-pack, who reads all of the data, and then starts resolving the deltas. We have to communicate that back. To make this work, we instruct the sideband muxer to enable keepalives in three phases: 1. In the beginning, not at all. 2. While reading from index-pack, wait for a signal indicating end-of-input, and then start them. 3. Afterwards, always. The signal from index-pack in phase 2 has to come over the stderr channel which the muxer is reading. We can't use an extra pipe because the portable run-command interface only gives us stderr and stdout. Stdout is already used to pass the .keep filename back to receive-pack. We could also send a signal there, but then we would find out about it in the main thread. And the keepalive needs to be done by the async muxer thread (since it's the one writing sideband data back to the client). And we can't reliably signal the async thread from the main thread, because the async code sometimes uses threads and sometimes uses forked processes. Therefore the signal must come over the stderr channel, where it may be interspersed with other random human-readable messages from index-pack. This patch makes the signal a single NUL byte. This is easy to parse, should not appear in any normal stderr output, and we don't have to worry about any timing issues (like seeing half the signal bytes in one read(), and half in a subsequent one). This is a bit ugly, but it's simple to code and should work reliably. Another option would be to stop using an async thread for muxing entirely, and just poll() both stderr and stdout of index-pack from the main thread. This would work for index-pack (because we aren't doing anything useful in the main thread while it runs anyway). But it would make the connectivity check and the hook muxers much more complicated, as they need to simultaneously feed the sub-programs while reading their stderr. The index-pack phase is the only one that needs this signaling, so it could simply behave differently than the other two. That would mean having two separate implementations of copy_to_sideband (and the keepalive code), though. And it still doesn't get rid of the signaling; it just means we can write a nicer message like "END_OF_INPUT" or something on stdout, since we don't have to worry about separating it from the stderr cruft. One final note: this signaling trick is only done with index-pack, not with unpack-objects. There's no point in doing it for the latter, because by definition it only kicks in for a small number of objects, where keepalives are not as useful (and this conveniently lets us avoid duplicating the implementation). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-20index-pack: add flag for showing delta-resolution progressJeff King
The index-pack command has two progress meters: one for "receiving objects", and one for "resolving deltas". You get neither by default, or both with "-v". But for a push through receive-pack, we would want only the "resolving deltas" phase, _not_ the "receiving objects" progress. There are two reasons for this. One is simply that existing clients are already printing "writing objects" progress at the same time. Arguably "receiving" from the far end is more useful, because it tells you what has actually gotten there, as opposed to what might be stuck in a buffer somewhere between the client and server. But that would require a protocol extension to tell clients not to print their progress. Possible, but complexity for little gain. The second reason is much more important. In a full-duplex connection like git-over-ssh, we can print progress while the pack is incoming, and it will immediately get to the client. But for a half-duplex connection like git-over-http, we should not say anything until we have received the full request. Anything we write is subject to being stuck in a buffer by the webserver. Worse, we can end up in a deadlock if that buffer fills up. So our best bet is to avoid writing anything that isn't a small fixed size until we've received the full pack. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-13index-pack: correct "offset" type in unpack_entry_data()Nguyễn Thái Ngọc Duy
unpack_entry_data() receives an off_t value from unpack_raw_entry(), which could be larger than unsigned long on 32-bit systems with large file support. Correct the type so truncation does not happen. This only affects bad object reporting though. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-13index-pack: report correct bad object offsets even if they are largeNguyễn Thái Ngọc Duy
Use the right type for offsets in this case, off_t, which makes a difference on 32-bit systems with large file support, and change formatting code accordingly. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-13index-pack: correct "len" type in unpack_data()Nguyễn Thái Ngọc Duy
On 32-bit systems with large file support, one entry could be larger than 4GB and overflow "len". Correct it so we can unpack a full entry. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-05-26Merge branch 'va/i18n-misc-updates' into maintJunio C Hamano
Mark several messages for translation. * va/i18n-misc-updates: i18n: unpack-trees: avoid substituting only a verb in sentences i18n: builtin/pull.c: split strings marked for translation i18n: builtin/pull.c: mark placeholders for translation i18n: git-parse-remote.sh: mark strings for translation i18n: branch: move comment for translators i18n: branch: unmark string for translation i18n: builtin/rm.c: remove a comma ',' from string i18n: unpack-trees: mark strings for translation i18n: builtin/branch.c: mark option for translation i18n: index-pack: use plural string instead of normal one
2016-05-17Merge branch 'va/i18n-misc-updates'Junio C Hamano
Mark several messages for translation. * va/i18n-misc-updates: i18n: unpack-trees: avoid substituting only a verb in sentences i18n: builtin/pull.c: split strings marked for translation i18n: builtin/pull.c: mark placeholders for translation i18n: git-parse-remote.sh: mark strings for translation i18n: branch: move comment for translators i18n: branch: unmark string for translation i18n: builtin/rm.c: remove a comma ',' from string i18n: unpack-trees: mark strings for translation i18n: builtin/branch.c: mark option for translation i18n: index-pack: use plural string instead of normal one
2016-04-15Merge branch 'jc/index-pack' into maintJunio C Hamano
Code clean-up. * jc/index-pack: index-pack: add a helper function to derive .idx/.keep filename index-pack: correct --keep[=<msg>]
2016-04-08i18n: index-pack: use plural string instead of normal oneVasco Almeida
Git could output "completed with 1 local objects", but in this case using "object" instead of "objects" is the correct form. Use Q_() instead of _(). Signed-off-by: Vasco Almeida <vascomalmeida@sapo.pt> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-04-03Merge branch 'jc/index-pack'Junio C Hamano
Code clean-up. * jc/index-pack: index-pack: add a helper function to derive .idx/.keep filename
2016-04-03Merge branch 'jc/maint-index-pack-keep'Junio C Hamano
"git index-pack --keep[=<msg>] pack-$name.pack" simply did not work. * jc/maint-index-pack-keep: index-pack: correct --keep[=<msg>]
2016-03-04Merge branch 'jk/pack-idx-corruption-safety'Junio C Hamano
The code to read the pack data using the offsets stored in the pack idx file has been made more carefully check the validity of the data in the idx. * jk/pack-idx-corruption-safety: sha1_file.c: mark strings for translation use_pack: handle signed off_t overflow nth_packed_object_offset: bounds-check extended offset t5313: test bounds-checks of corrupted/malicious pack/idx files
2016-03-03index-pack: add a helper function to derive .idx/.keep filenameJunio C Hamano
These are automatically named by replacing .pack suffix in the name of the packfile. Add a small helper to do so, as I'll be adding another one soonish. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-03-03Merge branch 'jc/maint-index-pack-keep' into jc/index-packJunio C Hamano
* jc/maint-index-pack-keep: index-pack: correct --keep[=<msg>]
2016-03-03index-pack: correct --keep[=<msg>]Junio C Hamano
When 592ce208 (index-pack: use strip_suffix to avoid magic numbers, 2014-06-30) refactored the code to derive names of .idx and .keep files from the name of .pack file, a copy-and-paste typo crept in, mistakingly attempting to create and store the keep message file in the .idx file we just created, instead of .keep file. As we create the .keep file with O_CREAT|O_EXCL, and we do so after we write the .idx file, we luckily do not clobber the .idx file, but because we deliberately ignored EEXIST when creating .keep file (which is justifiable because only the existence of .keep file matters), nobody noticed this mistake so far. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-02-25nth_packed_object_offset: bounds-check extended offsetJeff King
If a pack .idx file has a corrupted offset for an object, we may try to access an offset in the .idx or .pack file that is larger than the file's size. For the .pack case, we have use_pack() to protect us, which realizes the access is out of bounds. But if the corrupted value asks us to look in the .idx file's secondary 64-bit offset table, we blindly add it to the mmap'd index data and access arbitrary memory. We can fix this with a simple bounds-check compared to the size we found when we opened the .idx file. Note that there's similar code in index-pack that is triggered only during "index-pack --verify". To support both, we pull the bounds-check into a separate function, which dies when it sees a corrupted file. It would be nice if we could return an error, so that the pack code could try to find a good copy of the object elsewhere. Currently nth_packed_object_offset doesn't have any way to return an error, but it could probably use "0" as a sentinel value (since no object can start there). This is the minimal fix, and we can improve the resilience later on top. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-02-22use st_add and st_mult for allocation size computationJeff King
If our size computation overflows size_t, we may allocate a much smaller buffer than we expected and overflow it. It's probably impossible to trigger an overflow in most of these sites in practice, but it is easy enough convert their additions and multiplications into overflow-checking variants. This may be fixing real bugs, and it makes auditing the code easier. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-02-22convert trivial cases to ALLOC_ARRAYJeff King
Each of these cases can be converted to use ALLOC_ARRAY or REALLOC_ARRAY, which has two advantages: 1. It automatically checks the array-size multiplication for overflow. 2. It always uses sizeof(*array) for the element-size, so that it can never go out of sync with the declared type of the array. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-11-20Remove get_object_hash.brian m. carlson
Convert all instances of get_object_hash to use an appropriate reference to the hash member of the oid member of struct object. This provides no functional change, as it is essentially a macro substitution. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Jeff King <peff@peff.net>
2015-11-20Convert struct object to object_idbrian m. carlson
struct object is one of the major data structures dealing with object IDs. Convert it to use struct object_id instead of an unsigned char array. Convert get_object_hash to refer to the new member as well. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Jeff King <peff@peff.net>
2015-11-20Add several uses of get_object_hash.brian m. carlson
Convert most instances where the sha1 member of struct object is dereferenced to use get_object_hash. Most instances that are passed to functions that have versions taking struct object_id, such as get_sha1_hex/get_oid_hex, or instances that can be trivially converted to use struct object_id instead, are not converted. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Jeff King <peff@peff.net>
2015-09-25use xsnprintf for generating git object headersJeff King
We generally use 32-byte buffers to format git's "type size" header fields. These should not generally overflow unless you can produce some truly gigantic objects (and our types come from our internal array of constant strings). But it is a good idea to use xsnprintf to make sure this is the case. Note that we slightly modify the interface to write_sha1_file_prepare, which nows uses "hdrlen" as an "in" parameter as well as an "out" (on the way in it stores the allocated size of the header, and on the way out it returns the ultimate size of the header). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-08-19Merge branch 'jc/finalize-temp-file'Junio C Hamano
Long overdue micro clean-up. * jc/finalize-temp-file: sha1_file.c: rename move_temp_to_file() to finalize_object_file()
2015-08-10sha1_file.c: rename move_temp_to_file() to finalize_object_file()Junio C Hamano
Since 5a688fe4 ("core.sharedrepository = 0mode" should set, not loosen, 2009-03-25), we kept reminding ourselves: NEEDSWORK: this should be renamed to finalize_temp_file() as "moving" is only a part of what it does, when no patch between master to pu changes the call sites of this function. without doing anything about it. Let's do so. The purpose of this function was not to move but to finalize. The detail of the primarily implementation of finalizing was to link the temporary file to its final name and then to unlink, which wasn't even "moving". The alternative implementation did "move" by calling rename(2), which is a fun tangent. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-08-03Merge branch 'js/fsck-opt'Junio C Hamano
Allow ignoring fsck errors on specific set of known-to-be-bad objects, and also tweaking warning level of various kinds of non critical breakages reported. * js/fsck-opt: fsck: support ignoring objects in `git fsck` via fsck.skiplist fsck: git receive-pack: support excluding objects from fsck'ing fsck: introduce `git fsck --connectivity-only` fsck: support demoting errors to warnings fsck: document the new receive.fsck.<msg-id> options fsck: allow upgrading fsck warnings to errors fsck: optionally ignore specific fsck issues completely fsck: disallow demoting grave fsck errors to warnings fsck: add a simple test for receive.fsck.<msg-id> fsck: make fsck_tag() warn-friendly fsck: handle multiple authors in commits specially fsck: make fsck_commit() warn-friendly fsck: make fsck_ident() warn-friendly fsck: report the ID of the error/warning fsck (receive-pack): allow demoting errors to warnings fsck: offer a function to demote fsck errors to warnings fsck: provide a function to parse fsck message IDs fsck: introduce identifiers for fsck messages fsck: introduce fsck options
2015-07-27Merge branch 'jk/index-pack-reduce-recheck' into maintJunio C Hamano
Disable "have we lost a race with competing repack?" check while receiving a huge object transfer that runs index-pack. * jk/index-pack-reduce-recheck: index-pack: avoid excessive re-reading of pack directory
2015-07-09Merge branch 'jc/fix-alloc-sortbuf-in-index-pack'Junio C Hamano
A hotfix for what is in 2.5-rc but not in 2.4. * jc/fix-alloc-sortbuf-in-index-pack: index-pack: fix allocation of sorted_by_pos array
2015-07-04index-pack: fix allocation of sorted_by_pos arrayJunio C Hamano
When c6458e60 (index-pack: kill union delta_base to save memory, 2015-04-18) attempted to reduce the memory footprint of index-pack, one of the key thing it did was to keep track of ref-deltas and ofs-deltas separately. In fix_unresolved_deltas(), however it forgot that it now wants to look only at ref deltas in one place. The code allocated an array for nr_unresolved, which is sum of number of ref- and ofs-deltas minus nr_resolved, which may be larger or smaller than the number ref-deltas. Depending on nr_resolved, this was either under or over allocating. Also, the old code before this change had to use 'i' and 'n' because some of the things we see in the (old) deltas[] array we scanned with 'i' would not make it into the sorted_by_pos[] array in the old world order, but now because you have only ref delta in a separate ref_deltas[] array, they increment lock&step. We no longer need separate variables. And most importantly, we shouldn't pass the nr_unresolved parameter, as this number does not play a role in the working of this helper function. Helped-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-06-24Merge branch 'jk/index-pack-reduce-recheck'Junio C Hamano
Disable "have we lost a race with competing repack?" check while receiving a huge object transfer that runs index-pack. * jk/index-pack-reduce-recheck: index-pack: avoid excessive re-reading of pack directory
2015-06-23fsck (receive-pack): allow demoting errors to warningsJohannes Schindelin
For example, missing emails in commit and tag objects can be demoted to mere warnings with git config receive.fsck.missingemail=warn The value is actually a comma-separated list. In case that the same key is listed in multiple receive.fsck.<msg-id> lines in the config, the latter configuration wins (this can happen for example when both $HOME/.gitconfig and .git/config contain message type settings). As git receive-pack does not actually perform the checks, it hands off the setting to index-pack or unpack-objects in the form of an optional argument to the --strict option. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-06-22fsck: introduce fsck optionsJohannes Schindelin
Just like the diff machinery, we are about to introduce more settings, therefore it makes sense to carry them around as a (pointer to a) struct containing all of them. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-06-16Merge branch 'nd/slim-index-pack-memory-usage'Junio C Hamano
An earlier optimization broke index-pack for a large object transfer; this fixes it before the breakage hits any released version. * nd/slim-index-pack-memory-usage: index-pack: fix truncation of off_t in comparison
2015-06-09index-pack: avoid excessive re-reading of pack directoryJeff King
Since 45e8a74 (has_sha1_file: re-check pack directory before giving up, 2013-08-30), we spend extra effort for has_sha1_file to give the right answer when somebody else is repacking. Usually this effort does not matter, because after finding that the object does not exist, the next step is usually to die(). However, some code paths make a large number of has_sha1_file checks which are _not_ expected to return 1. The collision test in index-pack.c is such a case. On a local system, this can cause a performance slowdown of around 5%. But on a system with high-latency system calls (like NFS), it can be much worse. This patch introduces a "quick" flag to has_sha1_file which callers can use when they would prefer high performance at the cost of false negatives during repacks. There may be other code paths that can use this, but the index-pack one is the most obviously critical, so we'll start with switching that one. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-06-04index-pack: fix truncation of off_t in comparisonJeff King
Commit c6458e6 (index-pack: kill union delta_base to save memory, 2015-04-18) refactored the comparison functions used in sorting and binary searching our delta list. The resulting code does something like: int cmp_offsets(off_t a, off_t b) { return a - b; } This works most of the time, but produces nonsensical results when the difference between the two offsets is larger than what can be stored in an "int". This can lead to unresolved deltas if the packsize is larger than 2G (even on 64-bit systems, an int is still typically 32 bits): $ git clone git://github.com/mozilla/gecko-dev Cloning into 'gecko-dev'... remote: Counting objects: 4800161, done. remote: Compressing objects: 100% (178/178), done. remote: Total 4800161 (delta 88), reused 0 (delta 0), pack-reused 4799978 Receiving objects: 100% (4800161/4800161), 2.21 GiB | 3.26 MiB/s, done. Resolving deltas: 99% (3808820/3811944), completed with 0 local objects. fatal: pack has 3124 unresolved deltas fatal: index-pack failed We can fix it by doing direct comparisons between the offsets and returning constants; the callers only care about the sign of the comparison, not the magnitude. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-05-11Merge branch 'nd/slim-index-pack-memory-usage'Junio C Hamano
Memory usage of "git index-pack" has been trimmed by tens of per-cent. * nd/slim-index-pack-memory-usage: index-pack: kill union delta_base to save memory index-pack: reduce object_entry size to save memory
2015-04-19index-pack: kill union delta_base to save memoryNguyễn Thái Ngọc Duy
Once we know the number of objects in the input pack, we allocate an array of nr_objects of struct delta_entry. On x86-64, this struct is 32 bytes long. The union delta_base, which is part of struct delta_entry, provides enough space to store either ofs-delta (8 bytes) or ref-delta (20 bytes). Because ofs-delta encoding is more efficient space-wise and more performant at runtime than ref-delta encoding, Git packers try to use ofs-delta whenever possible, and it is expected that objects encoded as ref-delta are minority. In the best clone case where no ref-delta object is present, we waste (20-8) * nr_objects bytes because of this union. That's about 38MB out of 100MB for deltas[] with 3.4M objects, or 38%. deltas[] would be around 62MB without the waste. This patch attempts to eliminate that. deltas[] array is split into two: one for ofs-delta and one for ref-delta. Many functions are also duplicated because of this split. With this patch, ofs_deltas[] array takes 51MB. ref_deltas[] should remain unallocated in clone case (0 bytes). This array grows as we see ref-delta. We save about half in this case, or 25% of total bookkeeping. The saving is more than the calculation above because some padding in the old delta_entry struct is removed. ofs_delta_entry is 16 bytes, including the 4 bytes padding. That's 13MB for padding, but packing the struct could break platforms that do not support unaligned access. If someone on 32-bit is really low on memory and only deals with packs smaller than 2G, using 32-bit off_t would eliminate the padding and save 27MB on top. A note about ofs_deltas allocation. We could use ref_deltas memory allocation strategy for ofs_deltas. But that probably just adds more overhead on top. ofs-deltas are generally the majority (1/2 to 2/3) in any pack. Incremental realloc may lead to too many memcpy. And if we preallocate, say 1/2 or 2/3 of nr_objects initially, the growth rate of ALLOC_GROW() could make this array larger than nr_objects, wasting more memory. Brought-up-by: Matthew Sporleder <msporleder@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-03-17Merge branch 'rs/deflate-init-cleanup'Junio C Hamano
Code simplification. * rs/deflate-init-cleanup: zlib: initialize git_zstream in git_deflate_init{,_gzip,_raw}
2015-03-05zlib: initialize git_zstream in git_deflate_init{,_gzip,_raw}René Scharfe
Clear the git_zstream variable at the start of git_deflate_init() etc. so that callers don't have to do that. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-02-27index-pack: reduce object_entry size to save memoryNguyễn Thái Ngọc Duy
For each object in the input pack, we need one struct object_entry. On x86-64, this struct is 64 bytes long. Although: - The 8 bytes for delta_depth and base_object_no are only useful when show_stat is set. And it's never set unless someone is debugging. - The three fields hdr_size, type and real_type take 4 bytes each even though they never use more than 4 bits. By moving delta_depth and base_object_no out of struct object_entry and make the other 3 fields one byte long instead of 4, we shrink 25% of this struct. On a 3.4M object repo (*) that's about 53MB. The saving is less impressive compared to index-pack memory use for basic bookkeeping (**), about 16%. (*) linux-2.6.git already has 4M objects as of v3.19-rc7 so this is not an unrealistic number of objects that we have to deal with. (**) 3.4M * (sizeof(object_entry) + sizeof(delta_entry)) = 311MB Brought-up-by: Matthew Sporleder <msporleder@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-12-22Merge branch 'js/fsck-tag-validation'Junio C Hamano
New tag object format validation added in 2.2 showed garbage after a tagname it reported in its error message. * js/fsck-tag-validation: index-pack: terminate object buffers with NUL fsck: properly bound "invalid tag name" error message
2014-12-09index-pack: terminate object buffers with NULDuy Nguyen
We have some tricky checks in fsck that rely on a side effect of require_end_of_header(), and would otherwise easily run outside non-NUL-terminated buffers. This is a bit brittle, so let's make sure that only NUL-terminated buffers are passed around to begin with. Jeff "Peff" King contributed the detailed analysis which call paths are involved and pointed out that we also have to patch the get_data() function in unpack-objects.c, which is what Johannes "Dscho" Schindelin implemented. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Analyzed-by: Jeff King <peff@peff.net> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>