summaryrefslogtreecommitdiff
path: root/builtin/gc.c
AgeCommit message (Collapse)Author
2018-08-02Merge branch 'kg/gc-auto-windows-workaround'Junio C Hamano
"git gc --auto" opens file descriptors for the packfiles before spawning "git repack/prune", which would upset Windows that does not want a process to work on a file that is open by another process. The issue has been worked around. * kg/gc-auto-windows-workaround: gc --auto: release pack files before auto packing
2018-07-09gc --auto: release pack files before auto packingKim Gybels
Teach gc --auto to release pack files before auto packing the repository to prevent failures when removing them. Also teach the test 'fetching with auto-gc does not lock up' to complain when it is no longer triggering an auto packing of the repository. Fixes https://github.com/git-for-windows/git/issues/500 Signed-off-by: Kim Gybels <kgybels@infogroep.be> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-06-27gc: automatically write commit-graph filesDerrick Stolee
The commit-graph file is a very helpful feature for speeding up git operations. In order to make it more useful, make it possible to write the commit-graph file during standard garbage collection operations. Add a 'gc.commitGraph' config setting that triggers writing a commit-graph file after any non-trivial 'git gc' command. Defaults to false while the commit-graph feature matures. We specifically do not want to have this on by default until the commit-graph feature is fully integrated with history-modifying features like shallow clones. Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-30Merge branch 'ma/lockfile-cleanup'Junio C Hamano
Code clean-up to adjust to a more recent lockfile API convention that allows lockfile instances kept on the stack. * ma/lockfile-cleanup: lock_file: move static locks into functions lock_file: make function-local locks non-static refs.c: do not die if locking fails in `delete_pseudoref()` refs.c: do not die if locking fails in `write_pseudoref()` t/helper/test-write-cache: clean up lock-handling
2018-05-23Merge branch 'nd/repack-keep-pack'Junio C Hamano
"git gc" in a large repository takes a lot of time as it considers to repack all objects into one pack by default. The command has been taught to pretend as if the largest existing packfile is marked with ".keep" so that it is left untouched while objects in other packs and loose ones are repacked. * nd/repack-keep-pack: pack-objects: show some progress when counting kept objects gc --auto: exclude base pack if not enough mem to "repack -ad" gc: handle a corner case in gc.bigPackThreshold gc: add gc.bigPackThreshold config gc: add --keep-largest-pack option repack: add --keep-pack option t7700: have closing quote of a test at the beginning of line
2018-05-10lock_file: make function-local locks non-staticMartin Ågren
Placing `struct lock_file`s on the stack used to be a bad idea, because the temp- and lockfile-machinery would keep a pointer into the struct. But after 076aa2cbd (tempfile: auto-allocate tempfiles on heap, 2017-09-05), we can safely have lockfiles on the stack. (This applies even if a user returns early, leaving a locked lock behind.) These `struct lock_file`s are local to their respective functions and we can drop their staticness. For good measure, I have inspected these sites and come to believe that they always release the lock, with the possible exception of bailing out using `die()` or `exit()` or by returning from a `cmd_foo()`. As pointed out by Jeff King, it would be bad if someone held on to a `struct lock_file *` for some reason. After some grepping, I agree with his findings: no-one appears to be doing that. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-08Merge branch 'jc/parseopt-expiry-errors'Junio C Hamano
"git gc --prune=nonsense" spent long time repacking and then silently failed when underlying "git prune --expire=nonsense" failed to parse its command line. This has been corrected. * jc/parseopt-expiry-errors: parseopt: handle malformed --expire arguments more nicely gc: do not upcase error message shown with die()
2018-04-23parseopt: handle malformed --expire arguments more nicelyJunio C Hamano
A few commands that parse --expire=<time> command line option behave sillily when given nonsense input. For example $ git prune --no-expire Segmentation falut $ git prune --expire=npw; echo $? 129 Both come from parse_opt_expiry_date_cb(). The former is because the function is not prepared to see arg==NULL (for "--no-expire", it is a norm; "--expire" at the end of the command line could be made to pass NULL, if it is told that the argument is optional, but we don't so we do not have to worry about that case). The latter is because it does not check the value returned from the underlying parse_expiry_date(). This seems to be a recent regression introduced while we attempted to avoid spewing the entire usage message when given a correct option but with an invalid value at 3bb0923f ("parse-options: do not show usage upon invalid option value", 2018-03-22). Before that, we didn't fail silently but showed a full usage help (which arguably is not all that better). Also catch this error early when "git gc --prune=<expiration>" is misspelled by doing a dummy parsing before the main body of "gc" that is time consuming even begins. Otherwise, we'd spend time to pack objects and then later have "git prune" first notice the error. Aborting "gc" in the middle that way is not harmful but is ugly and can be avoided. Helped-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-23gc: do not upcase error message shown with die()Junio C Hamano
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16gc --auto: exclude base pack if not enough mem to "repack -ad"Nguyễn Thái Ngọc Duy
pack-objects could be a big memory hog especially on large repos, everybody knows that. The suggestion to stick a .keep file on the giant base pack to avoid this problem is also known for a long time. Recent patches add an option to do just this, but it has to be either configured or activated manually. This patch lets `git gc --auto` activate this mode automatically when it thinks `repack -ad` will use a lot of memory and start affecting the system due to swapping or flushing OS cache. gc --auto decides to do this based on an estimation of pack-objects memory usage, which is quite accurate at least for the heap part, and whether that fits in half of system memory (the assumption here is for desktop environment where there are many other applications running). This mechanism only kicks in if gc.bigBasePackThreshold is not configured. If it is, it is assumed that the user already knows what they want. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16gc: handle a corner case in gc.bigPackThresholdNguyễn Thái Ngọc Duy
This config allows us to keep <N> packs back if their size is larger than a limit. But if this N >= gc.autoPackLimit, we may have a problem. We are supposed to reduce the number of packs after a threshold because it affects performance. We could tell the user that they have incompatible gc.bigPackThreshold and gc.autoPackLimit, but it's kinda hard when 'git gc --auto' runs in background. Instead let's fall back to the next best stategy: try to reduce the number of packs anyway, but keep the base pack out. This reduces the number of packs to two and hopefully won't take up too much resources to repack (the assumption still is the base pack takes most resources to handle). Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16gc: add gc.bigPackThreshold configNguyễn Thái Ngọc Duy
The --keep-largest-pack option is not very convenient to use because you need to tell gc to do this explicitly (and probably on just a few large repos). Add a config key that enables this mode when packs larger than a limit are found. Note that there's a slight behavior difference compared to --keep-largest-pack: all packs larger than the threshold are kept, not just the largest one. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16gc: add --keep-largest-pack optionNguyễn Thái Ngọc Duy
This adds a new repack mode that combines everything into a secondary pack, leaving the largest pack alone. This could help reduce memory pressure. On linux-2.6.git, valgrind massif reports 1.6GB heap in "pack all" case, and 535MB in "pack all except the base pack" case. We save roughly 1GB memory by excluding the base pack. This should also lower I/O because we don't have to rewrite a giant pack every time (e.g. for linux-2.6.git that's a 1.4GB pack file).. PS. The use of string_list here seems overkill, but we'll need it in the next patch... Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11Merge branch 'sb/packfiles-in-repository'Junio C Hamano
Refactoring of the internal global data structure continues. * sb/packfiles-in-repository: packfile: keep prepare_packed_git() private packfile: allow find_pack_entry to handle arbitrary repositories packfile: add repository argument to find_pack_entry packfile: allow reprepare_packed_git to handle arbitrary repositories packfile: allow prepare_packed_git to handle arbitrary repositories packfile: allow prepare_packed_git_one to handle arbitrary repositories packfile: add repository argument to reprepare_packed_git packfile: add repository argument to prepare_packed_git packfile: add repository argument to prepare_packed_git_one packfile: allow install_packed_git to handle arbitrary repositories packfile: allow rearrange_packed_git to handle arbitrary repositories packfile: allow prepare_packed_git_mru to handle arbitrary repositories
2018-04-11Merge branch 'sb/object-store'Junio C Hamano
Refactoring the internal global data structure to make it possible to open multiple repositories, work with and then close them. Rerolled by Duy on top of a separate preliminary clean-up topic. The resulting structure of the topics looked very sensible. * sb/object-store: (27 commits) sha1_file: allow sha1_loose_object_info to handle arbitrary repositories sha1_file: allow map_sha1_file to handle arbitrary repositories sha1_file: allow map_sha1_file_1 to handle arbitrary repositories sha1_file: allow open_sha1_file to handle arbitrary repositories sha1_file: allow stat_sha1_file to handle arbitrary repositories sha1_file: allow sha1_file_name to handle arbitrary repositories sha1_file: add repository argument to sha1_loose_object_info sha1_file: add repository argument to map_sha1_file sha1_file: add repository argument to map_sha1_file_1 sha1_file: add repository argument to open_sha1_file sha1_file: add repository argument to stat_sha1_file sha1_file: add repository argument to sha1_file_name sha1_file: allow prepare_alt_odb to handle arbitrary repositories sha1_file: allow link_alt_odb_entries to handle arbitrary repositories sha1_file: add repository argument to prepare_alt_odb sha1_file: add repository argument to link_alt_odb_entries sha1_file: add repository argument to read_info_alternates sha1_file: add repository argument to link_alt_odb_entry sha1_file: add raw_object_store argument to alt_odb_usable pack: move approximate object count to object store ...
2018-03-26packfile: keep prepare_packed_git() privateNguyễn Thái Ngọc Duy
The reason callers have to call this is to make sure either packed_git or packed_git_mru pointers are initialized since we don't do that by default. Sometimes it's hard to see this connection between where the function is called and where packed_git pointer is used (sometimes in separate functions). Keep this dependency internal because now all access to packed_git and packed_git_mru must go through get_xxx() wrappers. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-26packfile: add repository argument to reprepare_packed_gitStefan Beller
See previous patch for explanation. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-26packfile: add repository argument to prepare_packed_gitStefan Beller
Add a repository argument to allow prepare_packed_git callers to be more specific about which repository to handle. See commit "sha1_file: add repository argument to link_alt_odb_entry" for an explanation of the #define trick. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-26object-store: move packed_git and packed_git_mru to object storeStefan Beller
In a process with multiple repositories open, packfile accessors should be associated to a single repository and not shared globally. Move packed_git and packed_git_mru into the_repository and adjust callers to reflect this. [nd: while at there, wrap access to these two fields in get_packed_git() and get_packed_git_mru(). This allows us to lazily initialize these fields without caller doing that explicitly] Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-14Merge branch 'nd/parseopt-completion'Junio C Hamano
Teach parse-options API an option to help the completion script, and make use of the mechanism in command line completion. * nd/parseopt-completion: (45 commits) completion: more subcommands in _git_notes() completion: complete --{reuse,reedit}-message= for all notes subcmds completion: simplify _git_notes completion: don't set PARSE_OPT_NOCOMPLETE on --rerere-autoupdate completion: use __gitcomp_builtin in _git_worktree completion: use __gitcomp_builtin in _git_tag completion: use __gitcomp_builtin in _git_status completion: use __gitcomp_builtin in _git_show_branch completion: use __gitcomp_builtin in _git_rm completion: use __gitcomp_builtin in _git_revert completion: use __gitcomp_builtin in _git_reset completion: use __gitcomp_builtin in _git_replace remote: force completing --mirror= instead of --mirror completion: use __gitcomp_builtin in _git_remote completion: use __gitcomp_builtin in _git_push completion: use __gitcomp_builtin in _git_pull completion: use __gitcomp_builtin in _git_notes completion: use __gitcomp_builtin in _git_name_rev completion: use __gitcomp_builtin in _git_mv completion: use __gitcomp_builtin in _git_merge_base ...
2018-02-09completion: use __gitcomp_builtin in _git_gcNguyễn Thái Ngọc Duy
The new completable option is --quiet. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-12-08gc: do not repack promisor packfilesJonathan Tan
Teach gc to stop traversal at promisor objects, and to leave promisor packfiles alone. This has the effect of only repacking non-promisor packfiles, and preserves the distinction between promisor packfiles and non-promisor packfiles. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-25Merge branch 'aw/gc-lockfile-fscanf-fix'Junio C Hamano
"git gc" tries to avoid running two instances at the same time by reading and writing pid/host from and to a lock file; it used to use an incorrect fscanf() format when reading, which has been corrected. * aw/gc-lockfile-fscanf-fix: gc: call fscanf() with %<len>s, not %<len>c, when reading hostname
2017-09-17gc: call fscanf() with %<len>s, not %<len>c, when reading hostnameJunio C Hamano
Earlier in this codepath, we (ab)used "%<len>c" to read the hostname recorded in the lockfile into locking_host[HOST_NAME_MAX + 1] while substituting <len> with the actual value of HOST_NAME_MAX. This turns out to be incorrect, as it is an instruction to read exactly the specified number of bytes. Because we are trying to read at most that many bytes, we should be using "%<len>s" instead. Helped-by: A. Wilcox <awilfox@adelielinux.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-06tempfile: auto-allocate tempfiles on heapJeff King
The previous commit taught the tempfile code to give up ownership over tempfiles that have been renamed or deleted. That makes it possible to use a stack variable like this: struct tempfile t; create_tempfile(&t, ...); ... if (!err) rename_tempfile(&t, ...); else delete_tempfile(&t); But doing it this way has a high potential for creating memory errors. The tempfile we pass to create_tempfile() ends up on a global linked list, and it's not safe for it to go out of scope until we've called one of those two deactivation functions. Imagine that we add an early return from the function that forgets to call delete_tempfile(). With a static or heap tempfile variable, the worst case is that the tempfile hangs around until the program exits (and some functions like setup_shallow_temporary rely on this intentionally, creating a tempfile and then leaving it for later cleanup). But with a stack variable as above, this is a serious memory error: the variable goes out of scope and may be filled with garbage by the time the tempfile code looks at it. Let's see if we can make it harder to get this wrong. Since many callers need to allocate arbitrary numbers of tempfiles, we can't rely on static storage as a general solution. So we need to turn to the heap. We could just ask all callers to pass us a heap variable, but that puts the burden on them to call free() at the right time. Instead, let's have the tempfile code handle the heap allocation _and_ the deallocation (when the tempfile is deactivated and removed from the list). This changes the return value of all of the creation functions. For the cleanup functions (delete and rename), we'll add one extra bit of safety: instead of taking a tempfile pointer, we'll take a pointer-to-pointer and set it to NULL after freeing the object. This makes it safe to double-call functions like delete_tempfile(), as the second call treats the NULL input as a noop. Several callsites follow this pattern. The resulting patch does have a fair bit of noise, as each caller needs to be converted to handle: 1. Storing a pointer instead of the struct itself. 2. Passing the pointer instead of taking the struct address. 3. Handling a "struct tempfile *" return instead of a file descriptor. We could play games to make this less noisy. For example, by defining the tempfile like this: struct tempfile { struct heap_allocated_part_of_tempfile { int fd; ...etc } *actual_data; } Callers would continue to have a "struct tempfile", and it would be "active" only when the inner pointer was non-NULL. But that just makes things more awkward in the long run. There aren't that many callers, so we can simply bite the bullet and adjust all of them. And the compiler makes it easy for us to find them all. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-08-23pack: move {,re}prepare_packed_git and approximate_object_countJonathan Tan
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-18Merge branch 'jk/gc-pre-detach-under-hook'Junio C Hamano
We run an early part of "git gc" that deals with refs before daemonising (and not under lock) even when running a background auto-gc, which caused multiple gc processes attempting to run the early part at the same time. This is now prevented by running the early part also under the GC lock. * jk/gc-pre-detach-under-hook: gc: run pre-detach operations under lock
2017-07-12Merge branch 'rs/use-div-round-up'Junio C Hamano
Code cleanup. * rs/use-div-round-up: use DIV_ROUND_UP
2017-07-12gc: run pre-detach operations under lockJeff King
We normally try to avoid having two auto-gc operations run at the same time, because it wastes resources. This was done long ago in 64a99eb47 (gc: reject if another gc is running, unless --force is given, 2013-08-08). When we do a detached auto-gc, we run the ref-related commands _before_ detaching, to avoid confusing lock contention. This was done by 62aad1849 (gc --auto: do not lock refs in the background, 2014-05-25). These two features do not interact well. The pre-detach operations are run before we check the gc.pid lock, meaning that on a busy repository we may run many of them concurrently. Ideally we'd take the lock before spawning any operations, and hold it for the duration of the program. This is tricky, though, with the way the pid-file interacts with the daemonize() process. Other processes will check that the pid recorded in the pid-file still exists. But detaching causes us to fork and continue running under a new pid. So if we take the lock before detaching, the pid-file will have a bogus pid in it. We'd have to go back and update it with the new pid after detaching. We'd also have to play some tricks with the tempfile subsystem to tweak the "owner" field, so that the parent process does not clean it up on exit, but the child process does. Instead, we can do something a bit simpler: take the lock only for the duration of the pre-detach work, then detach, then take it again for the post-detach work. Technically, this means that the post-detach lock could lose to another process doing pre-detach work. But in the long run this works out. That second process would then follow-up by doing post-detach work. Unless it was in turn blocked by a third process doing pre-detach work, and so on. This could in theory go on indefinitely, as the pre-detach work does not repack, and so need_to_gc() will continue to trigger. But in each round we are racing between the pre- and post-detach locks. Eventually, one of the post-detach locks will win the race and complete the full gc. So in the worst case, we may racily repeat the pre-detach work, but we would never do so simultaneously (it would happen via a sequence of serialized race-wins). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-10use DIV_ROUND_UPRené Scharfe
Convert code that divides and rounds up to use DIV_ROUND_UP to make the intent clearer and reduce the number of magic constants. Signed-off-by: Rene Scharfe <l.s.r@web.de> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-24Merge branch 'bw/config-h'Junio C Hamano
Fix configuration codepath to pay proper attention to commondir that is used in multi-worktree situation, and isolate config API into its own header file. * bw/config-h: config: don't implicitly use gitdir or commondir config: respect commondir setup: teach discover_git_directory to respect the commondir config: don't include config.h by default config: remove git_config_iter config: create config.h
2017-06-15config: don't include config.h by defaultBrandon Williams
Stop including config.h by default in cache.h. Instead only include config.h in those files which require use of the config system. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-16Merge branch 'js/larger-timestamps'Junio C Hamano
Some platforms have ulong that is smaller than time_t, and our historical use of ulong for timestamp would mean they cannot represent some timestamp that the platform allows. Invent a separate and dedicated timestamp_t (so that we can distingiuish timestamps and a vanilla ulongs, which along is already a good move), and then declare uintmax_t is the type to be used as the timestamp_t. * js/larger-timestamps: archive-tar: fix a sparse 'constant too large' warning use uintmax_t for timestamps date.c: abort if the system time cannot handle one of our timestamps timestamp_t: a new data type for timestamps PRItime: introduce a new "printf format" for timestamps parse_timestamp(): specify explicitly where we parse timestamps t0006 & t5000: skip "far in the future" test when time_t is too limited t0006 & t5000: prepare for 64-bit timestamps ref-filter: avoid using `unsigned long` for catch-all data type
2017-04-27timestamp_t: a new data type for timestampsJohannes Schindelin
Git's source code assumes that unsigned long is at least as precise as time_t. Which is incorrect, and causes a lot of problems, in particular where unsigned long is only 32-bit (notably on Windows, even in 64-bit versions). So let's just use a more appropriate data type instead. In preparation for this, we introduce the new `timestamp_t` data type. By necessity, this is a very, very large patch, as it has to replace all timestamps' data type in one go. As we will use a data type that is not necessarily identical to `time_t`, we need to be very careful to use `time_t` whenever we interact with the system functions, and `timestamp_t` everywhere else. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-04-24Merge branch 'dt/xgethostname-nul-termination'Junio C Hamano
gethostname(2) may not NUL terminate the buffer if hostname does not fit; unfortunately there is no easy way to see if our buffer was too small, but at least this will make sure we will not end up using garbage past the end of the buffer. * dt/xgethostname-nul-termination: xgethostname: handle long hostnames use HOST_NAME_MAX to size buffers for gethostname(2)
2017-04-19xgethostname: handle long hostnamesDavid Turner
If the full hostname doesn't fit in the buffer supplied to gethostname, POSIX does not specify whether the buffer will be null-terminated, so to be safe, we should do it ourselves. Introduce new function, xgethostname, which ensures that there is always a \0 at the end of the buffer. Signed-off-by: David Turner <dturner@twosigma.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-04-19use HOST_NAME_MAX to size buffers for gethostname(2)René Scharfe
POSIX limits the length of host names to HOST_NAME_MAX. Export the fallback definition from daemon.c and use this constant to make all buffers used with gethostname(2) big enough for any possible result and a terminating NUL. Inspired-by: David Turner <dturner@twosigma.com> Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: David Turner <dturner@twosigma.com> Reviewed-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-03-30gc: replace local buffer with git_pathJeff King
We probe the "17/" loose object directory for auto-gc, and use a local buffer to format the path. We can just use git_path() for this. It handles paths of any length (reducing our error handling). And because we feed the result straight to a system call, we can just use the static variant. Note that git_path also knows the string "objects/" is special, and will replace it with git_object_directory() when necessary. Another alternative would be to use sha1_file_name() for the pretend object "170000...", but that ends up being more hassle for no gain, as we have to truncate the final path component. Signed-off-by: Jeff King <peff@peff.net>
2017-03-17Merge branch 'cc/split-index-config'Junio C Hamano
The experimental "split index" feature has gained a few configuration variables to make it easier to use. * cc/split-index-config: (22 commits) Documentation/git-update-index: explain splitIndex.* Documentation/config: add splitIndex.sharedIndexExpire read-cache: use freshen_shared_index() in read_index_from() read-cache: refactor read_index_from() t1700: test shared index file expiration read-cache: unlink old sharedindex files config: add git_config_get_expiry() from gc.c read-cache: touch shared index files when used sha1_file: make check_and_freshen_file() non static Documentation/config: add splitIndex.maxPercentChange t1700: add tests for splitIndex.maxPercentChange read-cache: regenerate shared index if necessary config: add git_config_get_max_percent_split_change() Documentation/git-update-index: talk about core.splitIndex config var Documentation/config: add information for core.splitIndex t1700: add tests for core.splitIndex update-index: warn in case of split-index incoherency read-cache: add and then use tweak_split_index() split-index: add {add,remove}_split_index() functions config: add git_config_get_split_index() ...
2017-03-01config: add git_config_get_expiry() from gc.cChristian Couder
This function will be used in a following commit to get the expiration time of the shared index files from the config, and it is generic enough to be put in "config.c". Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-02-13gc: ignore old gc.log filesDavid Turner
A server can end up in a state where there are lots of unreferenced loose objects (say, because many users are doing a bunch of rebasing and pushing their rebased branches). Running "git gc --auto" in this state would cause a gc.log file to be created, preventing future auto gcs, causing pack files to pile up. Since many git operations are O(n) in the number of pack files, this would lead to poor performance. Git should never get itself into a state where it refuses to do any maintenance, just because at some point some piece of the maintenance didn't make progress. Teach Git to ignore gc.log files which are older than (by default) one day old, which can be tweaked via the gc.logExpiry configuration variable. That way, these pack files will get cleaned up, if necessary, at least once per day. And operators who find a need for more-frequent gcs can adjust gc.logExpiry to meet their needs. There is also some cleanup: a successful manual gc, or a warning-free auto gc with an old log file, will remove any old gc.log files. It might still happen that manual intervention is required (e.g. because the repo is corrupt), but at the very least it won't be because Git is too dumb to try again. Signed-off-by: David Turner <dturner@twosigma.com> Helped-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-12-29auto gc: don't write bitmaps for incremental repacksDavid Turner
When git gc --auto does an incremental repack of loose objects, we do not expect to be able to write a bitmap; it is very likely that objects in the new pack will have references to objects outside of the pack. So we shouldn't try to write a bitmap, because doing so will likely issue a warning. This warning was making its way into gc.log. When the gc.log was present, future auto gc runs would refuse to run. Patch by Jeff King. Bug report, test, and commit message by David Turner. Signed-off-by: David Turner <dturner@twosigma.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-29Merge branch 'jk/reduce-gc-aggressive-depth' into maintJunio C Hamano
"git gc --aggressive" used to limit the delta-chain length to 250, which is way too deep for gaining additional space savings and is detrimental for runtime performance. The limit has been reduced to 50. * jk/reduce-gc-aggressive-depth: gc: default aggressive depth to 50
2016-09-21Merge branch 'jk/reduce-gc-aggressive-depth'Junio C Hamano
"git gc --aggressive" used to limit the delta-chain length to 250, which is way too deep for gaining additional space savings and is detrimental for runtime performance. The limit has been reduced to 50. * jk/reduce-gc-aggressive-depth: gc: default aggressive depth to 50
2016-08-11gc: default aggressive depth to 50Jeff King
This commit message is long and has lots of background and numbers. The summary is: the current default of 250 doesn't save much space, and costs CPU. It's not a good tradeoff. Read on for details. The "--aggressive" flag to git-gc does three things: 1. use "-f" to throw out existing deltas and recompute from scratch 2. use "--window=250" to look harder for deltas 3. use "--depth=250" to make longer delta chains Items (1) and (2) are good matches for an "aggressive" repack. They ask the repack to do more computation work in the hopes of getting a better pack. You pay the costs during the repack, and other operations see only the benefit. Item (3) is not so clear. Allowing longer chains means fewer restrictions on the deltas, which means potentially finding better ones and saving some space. But it also means that operations which access the deltas have to follow longer chains, which affects their performance. So it's a tradeoff, and it's not clear that the tradeoff is even a good one. The existing "250" numbers for "--aggressive" come originally from this thread: http://public-inbox.org/git/alpine.LFD.0.9999.0712060803430.13796@woody.linux-foundation.org/ where Linus says: So when I said "--depth=250 --window=250", I chose those numbers more as an example of extremely aggressive packing, and I'm not at all sure that the end result is necessarily wonderfully usable. It's going to save disk space (and network bandwidth - the delta's will be re-used for the network protocol too!), but there are definitely downsides too, and using long delta chains may simply not be worth it in practice. There are some numbers in that thread, but they're mostly focused on the improved window size, and measure the improvement from --depth=250 and --window=250 together. E.g.: http://public-inbox.org/git/9e4733910712062006l651571f3w7f76ce64c6650dff@mail.gmail.com/ talks about the improved run-time of "git-blame", which comes from the reduced pack size. But most of that reduction is coming from --window=250, whereas most of the extra costs come from --depth=250. There's a link in that thread showing that increasing the depth beyond 50 doesn't seem to help much with the size: https://vcscompare.blogspot.com/2008/06/git-repack-parameters.html but again, no discussion of the timing impact. In an earlier thread from Ted Ts'o which discussed setting the non-aggressive default (from 10 to 50): http://public-inbox.org/git/20070509134958.GA21489%40thunk.org/ we have more numbers, with the conclusion that going past 50 does not help size much, and hurts the speed of normal operations. So from that, we might guess that 50 is actually a sweet spot, even for aggressive, if we interpret aggressive to "spend time now to make a better pack". It is not clear that "--depth=250" is actually a better pack. It may be slightly _smaller_, but it carries a run-time penalty. Here are some more recent timings I did to verify that. They show three things: - the size of the resulting pack (so disk saved to store, bandwidth saved on clones/fetches) - the cost of "rev-list --objects --all", which shows the effect of the delta chains on trees (commits typically don't delta, and the command doesn't touch the blobs at all) - the cost of "log -Sfoo", which will additionally access each blob All cases were repacked with "git repack -adf --depth=$d --window=250" (so basically, what would happen if we tweaked the "gc --aggressive" default depth). The timings are all wall-clock best-of-3. The machine itself has plenty of RAM compared to the repositories (which is probably typical of most workstations these days), so we're really measuring CPU usage, as the whole thing will be in disk cache after the first run. The core.deltaBaseCacheLimit is at its default of 96MiB. It's possible that tweaking it would have some impact on the tests, as some of them (especially "log -S" on a large repo) are likely to overflow that. But bumping that carries a run-time memory cost, so for these tests, I focused on what we could do just with the on-disk pack tradeoffs. Each test is done for four depths: 250 (the current value), 50 (the current default that tested well previously), 100 (to show something on the larger side, which previous tests showed was not a good tradeoff), and 10 (the very old default, which previous tests showed was worse than 50). Here are the numbers for linux.git: depth | size | % | rev-list | % | log -Sfoo | % -------+-------+-------+----------+--------+-----------+------- 250 | 967MB | n/a | 48.159s | n/a | 378.088 | n/a 100 | 971MB | +0.4% | 41.471s | -13.9% | 342.060 | -9.5% 50 | 979MB | +1.2% | 37.778s | -21.6% | 311.040s | -17.7% 10 | 1.1GB | +6.6% | 32.518s | -32.5% | 279.890s | -25.9% and for git.git: depth | size | % | rev-list | % | log -Sfoo | % -------+-------+-------+----------+--------+-----------+------- 250 | 48MB | n/a | 2.215s | n/a | 20.922s | n/a 100 | 49MB | +0.5% | 2.140s | -3.4% | 17.736s | -15.2% 50 | 49MB | +1.7% | 2.099s | -5.2% | 15.418s | -26.3% 10 | 53MB | +9.3% | 2.001s | -9.7% | 12.677s | -39.4% You can see that that the CPU savings for regular operations improves as we decrease the depth. The savings are less for "rev-list" on a smaller repository than they are for blob-accessing operations, or even rev-list on a larger repository. This may mean that a larger delta cache would help (though setting core.deltaBaseCacheLimit by itself doesn't). But we can also see that the space savings are not that great as the depth goes higher. Saving 5-10% between 10 and 50 is probably worth the CPU tradeoff. Saving 1% to go from 50 to 100, or another 0.5% to go from 100 to 250 is probably not. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-28Merge branch 'ew/gc-auto-pack-limit-fix' into maintJunio C Hamano
"gc.autoPackLimit" when set to 1 should not trigger a repacking when there is only one pack, but the code counted poorly and did so. * ew/gc-auto-pack-limit-fix: gc: fix off-by-one error with gc.autoPackLimit
2016-07-13Merge branch 'ew/gc-auto-pack-limit-fix'Junio C Hamano
"gc.autoPackLimit" when set to 1 should not trigger a repacking when there is only one pack, but the code counted poorly and did so. * ew/gc-auto-pack-limit-fix: gc: fix off-by-one error with gc.autoPackLimit
2016-06-27gc: fix off-by-one error with gc.autoPackLimitEric Wong
This matches the documentation and allows gc.autoPackLimit=1 to maintain a single pack without attempting a repack on every "git gc --auto" invocation. Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-11-20Merge branch 'dk/gc-idx-wo-pack'Jeff King
Having a leftover .idx file without corresponding .pack file in the repository hurts performance; "git gc" learned to prune them. * dk/gc-idx-wo-pack: gc: remove garbage .idx files from pack dir t5304: test cleaning pack garbage prepare_packed_git(): refactor garbage reporting in pack directory
2015-11-04gc: remove garbage .idx files from pack dirDoug Kelly
Add a custom report_garbage handler to collect and remove garbage .idx files from the pack directory. Signed-off-by: Doug Kelly <dougk.ff7@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>