summaryrefslogtreecommitdiff
path: root/sha1_file.c
AgeCommit message (Collapse)Author
2017-02-22sha1_file: introduce an nth_packed_object_oid functionbrian m. carlson
There are places in the code where we would like to provide a struct object_id *, yet read the hash directly from the pack. Provide an nth_packed_object_oid function that is similar to the nth_packed_object_sha1 function. In order to avoid a potentially invalid cast, nth_packed_object_oid provides a variable into which to store the value, which it returns on success; on error, it returns NULL, as nth_packed_object_sha1 does. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-31Merge branch 'jk/clear-delta-base-cache-fix'Junio C Hamano
A crashing bug introduced in v2.11 timeframe has been found (it is triggerable only in fast-import) and fixed. * jk/clear-delta-base-cache-fix: clear_delta_base_cache(): don't modify hashmap while iterating
2017-01-31Merge branch 'jk/loose-object-fsck'Junio C Hamano
"git fsck" inspects loose objects more carefully now. * jk/loose-object-fsck: fsck: detect trailing garbage in all object types fsck: parse loose object paths directly sha1_file: add read_loose_object() function t1450: test fsck of packed objects sha1_file: fix error message for alternate objects t1450: refactor loose-object removal
2017-01-19clear_delta_base_cache(): don't modify hashmap while iteratingJeff King
On Thu, Jan 19, 2017 at 03:03:46PM +0100, Ulrich Spörlein wrote: > > I suspect the patch below may fix things for you. It works around it by > > walking over the lru list (either is fine, as they both contain all > > entries, and since we're clearing everything, we don't care about the > > order). > > Confirmed. With the patch applied, I can import the whole 55G in one go > without any crashes or aborts. Thanks much! Thanks. Here it is rolled up with a commit message. -- >8 -- Subject: clear_delta_base_cache(): don't modify hashmap while iterating Removing entries while iterating causes fast-import to access an already-freed `struct packed_git`, leading to various confusing errors. What happens is that clear_delta_base_cache() drops the whole contents of the cache by iterating over the hashmap, calling release_delta_base_cache() on each entry. That function removes the item from the hashmap. The hashmap code may then shrink the table, but the hashmap_iter struct retains an offset from the old table. As a result, the next call to hashmap_iter_next() may claim that the iteration is done, even though some items haven't been visited. The only caller of clear_delta_base_cache() is fast-import, which wants to clear the cache because it is discarding the packed_git struct for its temporary pack. So by failing to remove all of the entries, we still have references to the freed packed_git. To make things even more confusing, this doesn't seem to trigger with the test suite, because it depends on complexities like the size of the hash table, which entries got cleared, whether we try to access them before they're evicted from the cache, etc. So I've been able to identify the problem with large imports like freebsd's svn import, or a fast-export of linux.git. But nothing that would be reasonable to run as part of the normal test suite. We can fix this easily by iterating over the lru linked list instead of the hashmap. They both contain the same entries, and we can use the "safe" variant of the list iterator, which exists for exactly this case. Let's also add a warning to the hashmap API documentation to reduce the chances of getting bit by this again. Reported-by: Ulrich Spörlein <uqs@freebsd.org> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-18Merge branch 'bw/grep-recurse-submodules'Junio C Hamano
"git grep" has been taught to optionally recurse into submodules. * bw/grep-recurse-submodules: grep: search history of moved submodules grep: enable recurse-submodules to work on <tree> objects grep: optionally recurse into submodules grep: add submodules as a grep source type submodules: load gitmodules file from commit sha1 submodules: add helper to determine if a submodule is initialized submodules: add helper to determine if a submodule is populated real_path: canonicalize directory separators in root parts real_path: have callers use real_pathdup and strbuf_realpath real_path: create real_pathdup real_path: convert real_path_internal to strbuf_realpath real_path: resolve symlinks by hand
2017-01-17Merge branch 'jk/quote-env-path-list-component' into maintJunio C Hamano
A recent update to receive-pack to make it easier to drop garbage objects made it clear that GIT_ALTERNATE_OBJECT_DIRECTORIES cannot have a pathname with a colon in it (no surprise!), and this in turn made it impossible to push into a repository at such a path. This has been fixed by introducing a quoting mechanism used when appending such a path to the colon-separated list. * jk/quote-env-path-list-component: t5615-alternate-env: double-quotes in file names do not work on Windows t5547-push-quarantine: run the path separator test on Windows, too tmp-objdir: quote paths we add to alternates alternates: accept double-quoted paths
2017-01-15fsck: detect trailing garbage in all object typesJeff King
When a loose tree or commit is read by fsck (or any git program), unpack_sha1_rest() checks whether there is extra cruft at the end of the object file, after the zlib data. Blobs that are streamed, however, do not have this check. For normal git operations, it's not a big deal. We know the sha1 and size checked out, so we have the object bytes we wanted. The trailing garbage doesn't affect what we're trying to do. But since the point of fsck is to find corruption or other problems, it should be more thorough. This patch teaches its loose-sha1 reader to detect extra bytes after the zlib stream and complain. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-15sha1_file: add read_loose_object() functionJeff King
It's surprisingly hard to ask the sha1_file code to open a _specific_ incarnation of a loose object. Most of the functions take a sha1, and loop over the various object types (packed versus loose) and locations (local versus alternates) at a low level. However, some tools like fsck need to look at a specific file. This patch gives them a function they can use to open the loose object at a given path. The implementation unfortunately ends up repeating bits of related functions, but there's not a good way around it without some major refactoring of the whole sha1_file stack. We need to mmap the specific file, then partially read the zlib stream to know whether we're streaming or not, and then finally either stream it or copy the data to a buffer. We can do that by assembling some of the more arcane internal sha1_file functions, but we end up having to essentially reimplement unpack_sha1_file(), along with the streaming bits of check_sha1_signature(). Still, most of the ugliness is contained in the new function, and the interface is clean enough that it may be reusable (though it seems unlikely anything but git-fsck would care about opening a specific file). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-15sha1_file: fix error message for alternate objectsJeff King
When we fail to open a corrupt loose object, we report an error and mention the filename via sha1_file_name(). However, that function will always give us a path in the local repository, whereas the corrupt object may have come from an alternate. The result is a very misleading error message. Teach the open_sha1_file() and stat_sha1_file() helpers to pass back the path they found, so that we can report it correctly. Note that the pointers we return go to static storage (e.g., from sha1_file_name()), which is slightly dangerous. However, these helpers are static local helpers, and the names are used for immediately generating error messages. The simplicity is an acceptable tradeoff for the danger. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-10Merge branch 'jc/git-open-cloexec'Junio C Hamano
The codeflow of setting NOATIME and CLOEXEC on file descriptors Git opens has been simplified. We may want to drop the tip one, but we'll see. * jc/git-open-cloexec: sha1_file: stop opening files with O_NOATIME git_open_cloexec(): use fcntl(2) w/ FD_CLOEXEC fallback git_open(): untangle possible NOATIME and CLOEXEC interactions
2016-12-21Merge branch 'jk/quote-env-path-list-component'Junio C Hamano
A recent update to receive-pack to make it easier to drop garbage objects made it clear that GIT_ALTERNATE_OBJECT_DIRECTORIES cannot have a pathname with a colon in it (no surprise!), and this in turn made it impossible to push into a repository at such a path. This has been fixed by introducing a quoting mechanism used when appending such a path to the colon-separated list. * jk/quote-env-path-list-component: t5615-alternate-env: double-quotes in file names do not work on Windows t5547-push-quarantine: run the path separator test on Windows, too tmp-objdir: quote paths we add to alternates alternates: accept double-quoted paths
2016-12-12real_path: have callers use real_pathdup and strbuf_realpathBrandon Williams
Migrate callers of real_path() who duplicate the retern value to use real_pathdup or strbuf_realpath. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-12-12alternates: accept double-quoted pathsJeff King
We read lists of alternates from objects/info/alternates files (delimited by newline), as well as from the GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable (delimited by colon or semi-colon, depending on the platform). There's no mechanism for quoting the delimiters, so it's impossible to specify an alternate path that contains a colon in the environment, or one that contains a newline in a file. We've lived with that restriction for ages because both alternates and filenames with colons are relatively rare, and it's only a problem when the two meet. But since 722ff7f87 (receive-pack: quarantine objects until pre-receive accepts, 2016-10-03), which builds on the alternates system, every push causes the receiver to set GIT_ALTERNATE_OBJECT_DIRECTORIES internally. It would be convenient to have some way to quote the delimiter so that we can represent arbitrary paths. The simplest thing would be an escape character before a quoted delimiter (e.g., "\:" as a literal colon). But that creates a backwards compatibility problem: any path which uses that escape character is now broken, and we've just shifted the problem. We could choose an unlikely escape character (e.g., something from the non-printable ASCII range), but that's awkward to use. Instead, let's treat names as unquoted unless they begin with a double-quote, in which case they are interpreted via our usual C-stylke quoting rules. This also breaks backwards-compatibility, but in a smaller way: it only matters if your file has a double-quote as the very _first_ character in the path (whereas an escape character is a problem anywhere in the path). It's also consistent with many other parts of git, which accept either a bare pathname or a double-quoted one, and the sender can choose to quote or not as required. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-11-10Merge branch 'jk/alt-odb-cleanup'Junio C Hamano
Fix a corner-case regression in a topic that graduated during the v2.11 cycle. * jk/alt-odb-cleanup: alternates: re-allow relative paths from environment
2016-11-08alternates: re-allow relative paths from environmentJeff King
Commit 670c359da (link_alt_odb_entry: handle normalize_path errors, 2016-10-03) regressed the handling of relative paths in the GIT_ALTERNATE_OBJECT_DIRECTORIES variable. It's not entirely clear this was ever meant to work, but it _has_ worked for several years, so this commit restores the original behavior. When we get a path in GIT_ALTERNATE_OBJECT_DIRECTORIES, we add it the path to the list of alternate object directories as if it were found in objects/info/alternates, but with one difference: we do not provide the link_alt_odb_entry() function with a base for relative paths. That function doesn't turn it into an absolute path, and we end up feeding the relative path to the strbuf_normalize_path() function. Most relative paths break out of the top-level directory (e.g., "../foo.git/objects"), and thus normalizing fails. Prior to 670c359da, we simply ignored the error, and due to the way normalize_path_copy() was implemented it happened to return the original path in this case. We then accessed the alternate objects using this relative path. By storing the relative path in the alt_odb list, the path is relative to wherever we happen to be at the time we do an object lookup. That means we look from $GIT_DIR in a bare repository, and from the top of the worktree in a non-bare repository. If this were being designed from scratch, it would make sense to pick a stable location (probably $GIT_DIR, or even the object directory) and use that as the relative base, turning the result into an absolute path. However, given the history, at this point the minimal fix is to match the pre-670c359da behavior. We can do this simply by ignoring the error when we have no relative base and using the original value (which we now reliably have, thanks to strbuf_normalize_path()). That still leaves us with a relative path that foils our duplicate detection, and may act strangely if we ever chdir() later in the process. We could solve that by storing an absolute path based on getcwd(). That may be a good future direction; for now we'll do just the minimum to fix the regression. The new t5615 script demonstrates the fix in its final three tests. Since we didn't have any tests of the alternates environment variable at all, it also adds some tests of absolute paths. Reported-by: Bryan Turner <bturner@atlassian.com> Signed-off-by: Jeff King <peff@peff.net>
2016-11-03sha1_file: stop opening files with O_NOATIMEJunio C Hamano
When we open object files, we try to do so with O_NOATIME. This dates back to 144bde78e9 (Use O_NOATIME when opening the sha1 files., 2005-04-23), which is an optimization to avoid creating a bunch of dirty inodes when we're accessing many objects. But a few things have changed since then: 1. In June 2005, git learned about packfiles, which means we would do a lot fewer atime updates (rather than one per object access, we'd generally get one per packfile). 2. In late 2006, Linux learned about "relatime", which is generally the default on modern installs. So performance around atimes updates is a non-issue there these days. All the world isn't Linux, but as it turns out, Linux is the only platform to implement O_NOATIME in the first place. So it's very unlikely that this code is helping anybody these days. Helped-by: Jeff King <peff@peff.net> [jc: took idea and log message from peff] Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-11-03git_open_cloexec(): use fcntl(2) w/ FD_CLOEXEC fallbackJunio C Hamano
A platform might not support open(2) with O_CLOEXEC but may support telling the same with fcntl(2) to flip FD_CLOEXEC bit on on an open file descriptor. It is a fallback that is inherently racy and this may not be worth doing, though. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-31Merge branch 'ls/git-open-cloexec'Junio C Hamano
Git generally does not explicitly close file descriptors that were open in the parent process when spawning a child process, but most of the time the child does not want to access them. As Windows does not allow removing or renaming a file that has a file descriptor open, a slow-to-exit child can even break the parent process by holding onto them. Use O_CLOEXEC flag to open files in various codepaths. * ls/git-open-cloexec: read-cache: make sure file handles are not inherited by child processes sha1_file: open window into packfiles with O_CLOEXEC sha1_file: rename git_open_noatime() to git_open()
2016-10-28git_open(): untangle possible NOATIME and CLOEXEC interactionsJunio C Hamano
The way we structured the fallback/retry mechanism for opening with O_NOATIME and O_CLOEXEC meant that if we failed due to lack of support to open the file with O_NOATIME option (i.e. EINVAL), we would still try to drop O_CLOEXEC first and retry, and then drop O_NOATIME. A platform on which O_NOATIME is defined in the header without support from the kernel wouldn't have a chance to open with O_CLOEXEC option due to this code structure. Arguably, O_CLOEXEC is more important than O_NOATIME, as the latter is mostly about performance, while the former can affect correctness. Instead use O_CLOEXEC to open the file, and then use fcntl(2) to set O_NOATIME on the resulting file descriptor. open(2) itself does not cause atime to be updated according to Linus [*1*]. The helper to do the former can be usable in the codepath in ce_compare_data() that was recently added to open a file descriptor with O_CLOEXEC; use it while we are at it. *1* <CA+55aFw83E+zOd+z5h-CA-3NhrLjVr-anL6pubrSWttYx3zu8g@mail.gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-27Merge branch 'jk/abbrev-auto'Junio C Hamano
Updates the way approximate count of total objects is computed while attempting to come up with a unique abbreviated object name, which in turn needs to estimate how many hexdigits are necessary to ensure uniqueness. * jk/abbrev-auto: find_unique_abbrev: move logic out of get_short_sha1()
2016-10-26Merge branch 'jk/fetch-quick-tag-following'Junio C Hamano
When fetching from a remote that has many tags that are irrelevant to branches we are following, we used to waste way too many cycles when checking if the object pointed at by a tag (that we are not going to fetch!) exists in our repository too carefully. * jk/fetch-quick-tag-following: fetch: use "quick" has_sha1_file for tag following
2016-10-25sha1_file: open window into packfiles with O_CLOEXECLars Schneider
All processes that the Git main process spawns inherit the open file descriptors of the main process. These leaked file descriptors can cause problems. Use the O_CLOEXEC flag similar to 05d1ed61 to fix the leaked file descriptors. Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-25sha1_file: rename git_open_noatime() to git_open()Lars Schneider
This function is meant to be used when reading from files in the object store, and the original objective was to avoid smudging atime of loose object files too often, hence its name. Because we'll be extending its role in the next commit to also arrange the file descriptors they return auto-closed in the child processes, rename it to lose "noatime" part that is too specific. Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-17Merge branch 'jk/alt-odb-cleanup'Junio C Hamano
Codepaths involved in interacting alternate object store have been cleaned up. * jk/alt-odb-cleanup: alternates: use fspathcmp to detect duplicates sha1_file: always allow relative paths to alternates count-objects: report alternates via verbose mode fill_sha1_file: write into a strbuf alternates: store scratch buffer as strbuf fill_sha1_file: write "boring" characters alternates: use a separate scratch space alternates: encapsulate alt->base munging alternates: provide helper for allocating alternate alternates: provide helper for adding to alternates list link_alt_odb_entry: refactor string handling link_alt_odb_entry: handle normalize_path errors t5613: clarify "too deep" recursion tests t5613: do not chdir in main process t5613: whitespace/style cleanups t5613: use test_must_fail t5613: drop test_valid_repo function t5613: drop reachable_via function
2016-10-14fetch: use "quick" has_sha1_file for tag followingJeff King
When we auto-follow tags in a fetch, we look at all of the tags advertised by the remote and fetch ones where we don't already have the tag, but we do have the object it peels to. This involves a lot of calls to has_sha1_file(), some of which we can reasonably expect to fail. Since 45e8a74 (has_sha1_file: re-check pack directory before giving up, 2013-08-30), this may cause many calls to reprepare_packed_git(), which is potentially expensive. This has gone unnoticed for several years because it requires a fairly unique setup to matter: 1. You need to have a lot of packs on the client side to make reprepare_packed_git() expensive (the most expensive part is finding duplicates in an unsorted list, which is currently quadratic). 2. You need a large number of tag refs on the server side that are candidates for auto-following (i.e., that the client doesn't have). Each one triggers a re-read of the pack directory. 3. Under normal circumstances, the client would auto-follow those tags and after one large fetch, (2) would no longer be true. But if those tags point to history which is disconnected from what the client otherwise fetches, then it will never auto-follow, and those candidates will impact it on every fetch. So when all three are true, each fetch pays an extra O(nr_tags * nr_packs^2) cost, mostly in string comparisons on the pack names. This was exacerbated by 47bf4b0 (prepare_packed_git_one: refactor duplicate-pack check, 2014-06-30) which uses a slightly more expensive string check, under the assumption that the duplicate check doesn't happen very often (and it shouldn't; the real problem here is how often we are calling reprepare_packed_git()). This patch teaches fetch to use HAS_SHA1_QUICK to sacrifice accuracy for speed, in cases where we might be racy with a simultaneous repack. This is similar to the fix in 0eeb077 (index-pack: avoid excessive re-reading of pack directory, 2015-06-09). As with that case, it's OK for has_sha1_file() occasionally say "no I don't have it" when we do, because the worst case is not a corruption, but simply that we may fail to auto-follow a tag that points to it. Here are results from the included perf script, which sets up a situation similar to the one described above: Test HEAD^ HEAD ---------------------------------------------------------- 5550.4: fetch 11.21(10.42+0.78) 0.08(0.04+0.02) -99.3% Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-10Merge branch 'jk/pack-objects-optim-mru'Junio C Hamano
"git pack-objects" in a repository with many packfiles used to spend a lot of time looking for/at objects in them; the accesses to the packfiles are now optimized by checking the most-recently-used packfile first. * jk/pack-objects-optim-mru: pack-objects: use mru list when iterating over packs pack-objects: break delta cycles before delta-search phase sha1_file: make packed_object_info public provide an initializer for "struct object_info"
2016-10-10alternates: use fspathcmp to detect duplicatesJeff King
On a case-insensitive filesystem, we should realize that "a/objects" and "A/objects" are the same path. We already use fspathcmp() to check against the main object directory, but until recently we couldn't use it for comparing against other alternates (because their paths were not NUL-terminated strings). But now we can, so let's do so. Note that we also need to adjust count-objects to load the config, so that it can see the setting of core.ignorecase (this is required by the test, but is also a general bugfix for users of count-objects). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-10sha1_file: always allow relative paths to alternatesJeff King
We recursively expand alternates repositories, so that if A borrows from B which borrows from C, A can see all objects. For the root object database, we allow relative paths, so A can point to B as "../B/objects". However, we currently do not allow relative paths when recursing, so B must use an absolute path to reach C. That is an ancient protection from c2f493a (Transitively read alternatives, 2006-05-07) that tries to avoid adding the same alternate through two different paths. Since 5bdf0a8 (sha1_file: normalize alt_odb path before comparing and storing, 2011-09-07), we use a normalized absolute path for each alt_odb entry. This means that in most cases the protection is no longer necessary; we will detect the duplicate no matter how we got there (but see below). And it's a good idea to get rid of it, as it creates an unnecessary complication when setting up recursive alternates (B has to know that A is going to borrow from it and make sure to use an absolute path). Note that our normalization doesn't actually look at the filesystem, so it can still be fooled by crossing symbolic links. But that's also true of absolute paths, so it's not a good reason to disallow only relative paths (it's potentially a reason to switch to real_path(), but that's a separate and non-trivial change). We adjust the test script here to demonstrate that this now works, and add new tests to show that the normalization does indeed suppress duplicates. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-10fill_sha1_file: write into a strbufJeff King
It's currently the responsibility of the caller to give fill_sha1_file() enough bytes to write into, leading them to manually compute the required lengths. Instead, let's just write into a strbuf so that it's impossible to get this wrong. The alt_odb caller already has a strbuf, so this makes things strictly simpler. The other caller, sha1_file_name(), uses a static PATH_MAX buffer and dies when it would overflow. We can convert this to a static strbuf, which means our allocation cost is amortized (and as a bonus, we no longer have to worry about PATH_MAX being too short for normal use). This does introduce some small overhead in fill_sha1_file(), as each strbuf_addchar() will check whether it needs to grow. However, between the optimization in fec501d (strbuf_addch: avoid calling strbuf_grow, 2015-04-16) and the fact that this is not generally called in a tight loop (after all, the next step is typically to access the file!) this probably doesn't matter. And even if it did, the right place to micro-optimize is inside fill_sha1_file(), by calling a single strbuf_grow() there. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-10alternates: store scratch buffer as strbufJeff King
We pre-size the scratch buffer to hold a loose object filename of the form "xx/yyyy...", which leads to allocation code that is hard to verify. We have to use some magic numbers during the initial allocation, and then writers must blindly assume that the buffer is big enough. Using a strbuf makes it more clear that we cannot overflow. Unfortunately, we do still need some magic numbers to grow our strbuf before calling fill_sha1_path(), but the strbuf growth is much closer to the point of use. This makes it easier to see that it's correct, and opens the possibility of pushing it even further down if fill_sha1_path() learns to work on strbufs. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-10fill_sha1_file: write "boring" charactersJeff King
This function forms a sha1 as "xx/yyyy...", but skips over the slot for the slash rather than writing it, leaving it to the caller to do so. It also does not bother to put in a trailing NUL, even though every caller would want it (we're forming a path which by definition is not a directory, so the only thing to do with it is feed it to a system call). Let's make the lives of our callers easier by just writing out the internal "/" and the NUL. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-10alternates: use a separate scratch spaceJeff King
The alternate_object_database struct uses a single buffer both for storing the path to the alternate, and as a scratch buffer for forming object names. This is efficient (since otherwise we'd end up storing the path twice), but it makes life hard for callers who just want to know the path to the alternate. They have to remember to stop reading after "alt->name - alt->base" bytes, and to subtract one for the trailing '/'. It would be much simpler if they could simply access a NUL-terminated path string. We could encapsulate this in a function which puts a NUL in the scratch buffer and returns the string, but that opens up questions about the lifetime of the result. The first time another caller uses the alternate, the scratch buffer may get other data tacked onto it. Let's instead just store the root path separately from the scratch buffer. There aren't enough alternates being stored for the duplicated data to matter for performance, and this keeps things simple and safe for the callers. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-10alternates: encapsulate alt->base mungingJeff King
The alternate_object_database struct holds a path to the alternate objects, but we also use that buffer as scratch space for forming loose object filenames. Let's pull that logic into a helper function so that we can more easily modify it. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-10alternates: provide helper for allocating alternateJeff King
Allocating a struct alternate_object_database is tricky, as we must over-allocate the buffer to provide scratch space, and then put in particular '/' and NUL markers. Let's encapsulate this in a function so that the complexity doesn't leak into callers (and so that we can modify it later). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-10alternates: provide helper for adding to alternates listJeff King
The submodule code wants to temporarily add an alternate object store to our in-memory alt_odb list, but does it manually. Let's provide a helper so it can reuse the code in link_alt_odb_entry(). While we're adding our new add_to_alternates_memory(), let's document add_to_alternates_file(), as the two are related. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-10link_alt_odb_entry: refactor string handlingJeff King
The string handling in link_alt_odb_entry() is mostly an artifact of the original version, which took the path as a ptr/len combo, and did not have a NUL-terminated string until we created one in the alternate_object_database struct. But since 5bdf0a8 (sha1_file: normalize alt_odb path before comparing and storing, 2011-09-07), the first thing we do is put the path into a strbuf, which gives us some easy opportunities for cleanup. In particular: - we call strlen(pathbuf.buf), which is silly; we can look at pathbuf.len. - even though we have a strbuf, we don't maintain its "len" field when chomping extra slashes from the end, and instead keep a separate "pfxlen" variable. We can fix this and then drop "pfxlen" entirely. - we don't check whether the path is usable until after we allocate the new struct, making extra cleanup work for ourselves. Since we have a NUL-terminated string, we can bump the "is it usable" checks higher in the function. While we're at it, we can move that logic to its own helper, which makes the flow of link_alt_odb_entry() easier to follow. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-10link_alt_odb_entry: handle normalize_path errorsJeff King
When we add a new alternate to the list, we try to normalize out any redundant "..", etc. However, we do not look at the return value of normalize_path_copy(), and will happily continue with a path that could not be normalized. Worse, the normalizing process is done in-place, so we are left with whatever half-finished working state the normalizing function was in. Fortunately, this cannot cause us to read past the end of our buffer, as that working state will always leave the NUL from the original path in place. And we do tend to notice problems when we check is_directory() on the path. But you can see the nonsense that we feed to is_directory with an entry like: this/../../is/../../way/../../too/../../deep/../../to/../../resolve in your objects/info/alternates, which yields: error: object directory /to/e/deep/too/way//ects/this/../../is/../../way/../../too/../../deep/../../to/../../resolve does not exist; check .git/objects/info/alternates. We can easily fix this just by checking the return value. But that makes it hard to generate a good error message, since we're normalizing in-place and our input value has been overwritten by cruft. Instead, let's provide a strbuf helper that does an in-place normalize, but restores the original contents on error. This uses a second buffer under the hood, which is slightly less efficient, but this is not a performance-critical code path. The strbuf helper can also properly set the "len" parameter of the strbuf before returning. Just doing: normalize_path_copy(buf.buf, buf.buf); will shorten the string, but leave buf.len at the original length. That may be confusing to later code which uses the strbuf. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-04find_unique_abbrev: move logic out of get_short_sha1()Jeff King
The get_short_sha1() is only about reading short sha1s; we do call it in a loop to check "is this long enough" for each object, but otherwise it should not need to know about things like our default_abbrev setting. So instead of asking it to set default_automatic_abbrev as a side-effect, let's just have find_unique_abbrev() pick the right place to start its loop. This requires a separate approximate_object_count() function, but that naturally belongs with the rest of sha1_file.c. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03Merge branch 'jc/verify-loose-object-header'Junio C Hamano
Codepaths that read from an on-disk loose object were too loose in validating what they are reading is a proper object file and sometimes read past the data they read from the disk, which has been corrected. H/t to Gustavo Grieco for reporting. * jc/verify-loose-object-header: unpack_sha1_header(): detect malformed object header streaming: make sure to notice corrupt object
2016-09-26unpack_sha1_header(): detect malformed object headerJunio C Hamano
When opening a loose object file, we often do this sequence: - prepare a short buffer for the object header (on stack) - call unpack_sha1_header() and have early part of the object data inflated, enough to fill the buffer - parse that data in the short buffer, assuming that the first part of the object is <typename> SP <length> NUL Because the parsing function parse_sha1_header_extended() is not given the number of bytes inflated into the header buffer, it you craft a file whose early part inflates a garbage sequence without SP or NUL, and replace a loose object with it, it will end up reading past the end of the inflated data. To correct this, do the following four things: - rename unpack_sha1_header() to unpack_sha1_short_header() and have unpack_sha1_header_to_strbuf() keep calling that as its helper function. This will detect and report zlib errors, but is not aware of the format of a loose object (as before). - introduce unpack_sha1_header() that calls the same helper function, and when zlib reports it inflated OK into the buffer, check if the inflated data has NUL. This would ensure that parsing function will terminate within the buffer that holds the inflated header. - update unpack_sha1_header_to_strbuf() to check if the resulting buffer has NUL for the same effect. - update parse_sha1_header_extended() to make sure that its loop to find the SP that terminates the <typename> stops at NUL. Essentially, this makes unpack_*() functions that are asked to unpack a loose object header to be a bit more strict and detect an input that cannot possibly be a valid object header, even before the parsing function kicks in. Reported-by: Gustavo Grieco <gustavo.grieco@imag.fr> Helped-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-21Merge branch 'rs/pack-sort-with-llist-mergesort'Junio C Hamano
Code cleanup. * rs/pack-sort-with-llist-mergesort: sha1_file: use llist_mergesort() for sorting packs
2016-09-21Merge branch 'jk/delta-base-cache'Junio C Hamano
Recently we updated the code to manage the in-core cache that holds objects that have recently been used to reconstitute other objects that are stored as deltas against them, but the update used an incorrect API function to manage the list of these objects. This has been fixed. * jk/delta-base-cache: add_delta_base_cache: use list_for_each_safe
2016-09-13sha1_file: use llist_mergesort() for sorting packsRené Scharfe
Sort the linked list of packs directly using llist_mergesort() instead of building an array, calling qsort(3) and fixing up the list pointers. This is shorter and less complicated. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-12Merge branch 'jk/diff-submodule-diff-inline'Junio C Hamano
The "git diff --submodule={short,log}" mechanism has been enhanced to allow "--submodule=diff" to show the patch between the submodule commits bound to the superproject. * jk/diff-submodule-diff-inline: diff: teach diff to display submodule difference with an inline diff submodule: refactor show_submodule_summary with helper function submodule: convert show_submodule_summary to use struct object_id * allow do_submodule_path to work even if submodule isn't checked out diff: prepare for additional submodule formats graph: add support for --line-prefix on all graph-aware output diff.c: remove output_prefix_length field cache: add empty_tree_oid object and helper function
2016-09-12add_delta_base_cache: use list_for_each_safeJeff King
We may remove elements from the list while we are iterating, which requires using a second temporary pointer. Otherwise stepping to the next element of the list might involve looking at freed memory (which generally works in practice, as we _just_ freed it, but of course is wrong to rely on; valgrind notices it). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-09Merge branch 'sb/submodule-clone-rr'Junio C Hamano
"git clone --resurse-submodules --reference $path $URL" is a way to reduce network transfer cost by borrowing objects in an existing $path repository when cloning the superproject from $URL; it learned to also peek into $path for presense of corresponding repositories of submodules and borrow objects from there when able. * sb/submodule-clone-rr: clone: recursive and reference option triggers submodule alternates clone: implement optional references clone: clarify option_reference as required clone: factor out checking for an alternate path submodule--helper update-clone: allow multiple references submodule--helper module-clone: allow multiple references t7408: merge short tests, factor out testing method t7408: modernize style
2016-09-01cache: add empty_tree_oid object and helper functionJacob Keller
Similar to is_null_oid(), and is_empty_blob_sha1() add an empty_tree_oid along with helper function is_empty_tree_oid(). For completeness, also add an "is_empty_tree_sha1()", "is_empty_blob_sha1()", "is_empty_tree_oid()" and "is_empty_blob_oid()" helpers. To ensure we only get one singleton, implement EMPTY_BLOB_SHA1_BIN as simply getting the hash of empty_blob_oid structure. Signed-off-by: Jacob Keller <jacob.keller@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-23delta_base_cache: use hashmap.hJeff King
The fundamental data structure of the delta base cache is a hash table mapping pairs of "(packfile, offset)" into structs containing the actual object data. The hash table implementation dates back to e5e0161 (Implement a simple delta_base cache, 2007-03-17), and uses a fixed-size table. The current size is a hard-coded 256 entries. Because we need to be able to remove objects from the hash table, entry lookup does not do any kind of probing to handle collisions. Colliding items simply replace whatever is in their slot. As a result, we have fewer usable slots than even the 256 we allocate. At half full, each new item has a 50% chance of displacing another one. Or another way to think about it: every item has a 1/256 chance of being ejected due to hash collision, without regard to our LRU strategy. So it would be interesting to see the effect of increasing the cache size on the runtime for some common operations. As with the previous patch, we'll measure "git log --raw" for tree-only operations, and "git log -Sfoo --raw" for operations that touch trees and blobs. All times are wall-clock best-of-3, done against fully packed repos with --depth=50, and the default core.deltaBaseCacheLimit of 96MB. Here are timings for various values of MAX_DELTA_CACHE against git.git (the asterisk marks the minimum time for each operation): MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m02.227s 0m12.821s 512 0m02.143s 0m10.602s 1024 0m02.127s 0m08.642s 2048 0m02.148s 0m07.123s 4096 0m02.194s 0m06.448s* 8192 0m02.239s 0m06.504s 16384 0m02.144s* 0m06.502s 32768 0m02.202s 0m06.622s 65536 0m02.230s 0m06.677s The log-raw case isn't changed much at all here (probably because our trees just aren't that big in the first place, or possibly because we have so _few_ trees in git.git that the 256-entry cache is enough). But once we start putting blobs in the cache, too, we see a big improvement (almost 50%). The curve levels off around 4096, which means that we can hold about that many entries before hitting the 96MB memory limit (or possibly that the workload is small enough that there is simply no more work to be optimized out by caching more). (As a side note, I initially timed my existing git.git pack, which was a base of --aggressive combined with some pulls on top. So it had quite a few deeper delta chains. The 256-cache case was more like 15s, and it still dropped to ~6.5s in the same way). Here are the timings for linux.git: MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m41.661s 5m12.410s 512 0m39.547s 5m07.920s 1024 0m37.054s 4m54.666s 2048 0m35.871s 4m41.194s* 4096 0m34.646s 4m51.648s 8192 0m33.881s 4m55.342s 16384 0m35.190s 5m00.122s 32768 0m35.060s 4m58.851s 65536 0m33.311s* 4m51.420s As we grow we see a nice 20% speedup in the tree traversal, and more modest 10% in the log-S. This is probably an indication that we are bound less by the number of entries, and more by the memory limit (more on that below). What is interesting is that the numbers bounce around a bit; increasing the number of entries isn't always a strict improvement. Partially this is due to noise in the measurement. But it may also be an indication that our LRU ejection scheme is not optimal. The smaller cache sizes introduce some randomness into the ejection (due to collisions), which may sometimes work in our favor (and sometimes not!). So what is the optimal setting of MAX_DELTA_CACHE? The "bouncing" in the linux.git log-S numbers notwithstanding, it mostly seems like bigger is better. And even if we were to try to find a "sweet spot", these are just two repositories, that are not necessarily representative. The shape of history, the size of trees and blobs, the memory limit configuration, etc, all will affect the outcome. Rather than trying to find the "right" number, another strategy is to just switch to a hash table that can actually store collisions: namely our hashmap.h implementation. Here are numbers for that compared to the "best" we saw from adjusting MAX_DELTA_CACHE: | log-raw | log-S | best hashmap | best hashmap | --------- --------- | --------- --------- git | 0m02.144s 0m02.144s | 0m06.448s 0m06.688s linux | 0m33.311s 0m33.092s | 4m41.194s 4m57.172s We can see the results are similar in most cases, which is what we'd expect. We're not ejecting due to collisions at all, so this is purely representing the LRU. So really, we'd expect this to model most closely the larger values of the static MAX_DELTA_CACHE limit. And that does seem to be what's happening, including the "bounce" in the linux log-S case. So while the value for that case _isn't_ as good as the optimal one measured above (which was 2048 entries), given the bouncing I'm hesitant to suggest that 2048 is any kind of optimum (not even for linux.git, let alone as a general rule). The generic hashmap has the appeal that it drops the number of tweakable numbers by one, which means we can focus on tuning other elements, like the LRU strategy or the core.deltaBaseCacheLimit setting. And indeed, if we bump the cache limit to 1G (which is probably silly for general use, but maybe something people with big workstations would want to do), the linux.git log-S time drops to 3m32s. That's something you really _can't_ do easily with the static hash table, because the number of entries needs to grow in proportion to the memory limit (so 2048 is almost certainly not going to be the right value there). This patch takes that direction, and drops the static hash table entirely in favor of using the hashmap.h API. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-23delta_base_cache: drop special treatment of blobsJeff King
When the delta base cache runs out of allowed memory, it has to drop entries. It does so by walking an LRU list, dropping objects until we are under the memory limit. But we actually walk the list twice: once to drop blobs, and then again to drop other objects (which are generally trees). This comes from 18bdec1 (Limit the size of the new delta_base_cache, 2007-03-19). This performs poorly as the number of entries grows, because any time dropping blobs does not satisfy the limit, we have to walk the _entire_ list, trees included, looking for blobs to drop, before starting to drop any trees. It's not generally a problem now, as the cache is limited to only 256 entries. But as we could benefit from increasing that in a future patch, it's worth looking at how it performs as the cache size grows. And the answer is "not well". The table below shows times for various operations with different values of MAX_DELTA_CACHE (which is not a run-time knob; I recompiled with -DMAX_DELTA_CACHE=$n for each). I chose "git log --raw" ("log-raw" in the table) because it will access all of the trees, but no blobs at all (so in a sense it is a worst case for this problem, because we will always walk over the entire list of trees once before realizing there are no blobs to drop). This is also representative of other tree-only operations like "rev-list --objects" and "git log -- <path>". I also timed "git log -Sfoo --raw" ("log-S" in the table). It similarly accesses all of the trees, but also the blobs for each commit. It's representative of "git log -p", though it emphasizes the cost of blob access more, as "-S" is cheaper than computing an actual blob diff. All timings are best-of-3 wall-clock times (though they all were CPU bound, so the user CPU times are similar). The repositories were fully packed with --depth=50, and the default core.deltaBaseCacheLimit of 96M was in effect. The current value of MAX_DELTA_CACHE is 256, so I started there and worked up by factors of 2. First, here are values for git.git (the asterisk signals the fastest run for each operation): MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m02.212s 0m12.634s 512 0m02.136s* 0m10.614s 1024 0m02.156s 0m08.614s 2048 0m02.208s 0m07.062s 4096 0m02.190s 0m06.484s* 8192 0m02.176s 0m07.635s 16384 0m02.913s 0m19.845s 32768 0m03.617s 1m05.507s 65536 0m04.031s 1m18.488s You can see that for the tree-only log-raw case, we don't actually benefit that much as the cache grows (all the differences up through 8192 are basically just noise; this is probably because we don't actually have that many distinct trees in git.git). But for log-S, we get a definite speed improvement as the cache grows, but the improvements are lost as cache size grows and the linear LRU management starts to dominate. Here's the same thing run against linux.git: MAX_DELTA_CACHE log-raw log-S --------------- --------- ---------- 256 0m40.987s 5m13.216s 512 0m37.949s 5m03.243s 1024 0m35.977s 4m50.580s 2048 0m33.855s 4m39.818s 4096 0m32.913s 4m47.299s* 8192 0m32.176s* 5m14.650s 16384 0m32.185s 6m31.625s 32768 0m38.056s 9m31.136s 65536 1m30.518s 17m38.549s The pattern is similar, though the effect in log-raw is more pronounced here. The times dip down in the middle, and then go back up as we keep growing. So we know there's a problem. What's the solution? The obvious one is to improve the data structure to avoid walking over tree entries during the looking-for-blobs traversal. We can do this by keeping _two_ LRU lists: one for blobs, and one for other objects. We drop items from the blob LRU first, and then from the tree LRU (if necessary). Here's git.git using that strategy: MAX_DELTA_CACHE log-raw log-S --------------- --------- ---------- 256 0m02.264s 0m12.830s 512 0m02.201s 0m10.771s 1024 0m02.181s 0m08.593s 2048 0m02.205s 0m07.116s 4096 0m02.158s 0m06.537s* 8192 0m02.213s 0m07.246s 16384 0m02.155s* 0m10.975s 32768 0m02.159s 0m16.047s 65536 0m02.181s 0m16.992s The upswing on log-raw is gone completely. But log-S still has it (albeit much better than without this strategy). Let's see what linux.git shows: MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m42.519s 5m14.654s 512 0m39.106s 5m04.708s 1024 0m36.802s 4m51.454s 2048 0m34.685s 4m39.378s* 4096 0m33.663s 4m44.047s 8192 0m33.157s 4m50.644s 16384 0m33.090s* 4m49.648s 32768 0m33.458s 4m53.371s 65536 0m33.563s 5m04.580s The results are similar. The tree-only case again performs well (not surprising; we're literally just dropping the one useless walk, and not otherwise changing the cache eviction strategy at all). But the log-S case again does a bit worse as the cache grows (though possibly that's within the noise, which is much larger for this case). Perhaps this is an indication that the "remove blobs first" strategy is not actually optimal. The intent of it is to avoid blowing out the tree cache when we see large blobs, but it also means we'll throw away useful, recent blobs in favor of older trees. Let's run the same numbers without caring about object type at all (i.e., one LRU list, and always evicting whatever is at the head, regardless of type). Here's git.git: MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m02.227s 0m12.821s 512 0m02.143s 0m10.602s 1024 0m02.127s 0m08.642s 2048 0m02.148s 0m07.123s 4096 0m02.194s 0m06.448s* 8192 0m02.239s 0m06.504s 16384 0m02.144s* 0m06.502s 32768 0m02.202s 0m06.622s 65536 0m02.230s 0m06.677s Much smoother; there's no dramatic upswing as we increase the cache size (some remains, though it's small enough that it's mostly run-to-run noise. E.g., in the log-raw case, note how 8192 is 50-100ms higher than its neighbors). Note also that we stop getting any real benefit for log-S after about 4096 entries; that number will depend on the size of the repository, the size of the blob entries, and the memory limit of the cache. Let's see what linux.git shows for the same strategy: MAX_DELTA_CACHE log-raw log-S --------------- --------- --------- 256 0m41.661s 5m12.410s 512 0m39.547s 5m07.920s 1024 0m37.054s 4m54.666s 2048 0m35.871s 4m41.194s* 4096 0m34.646s 4m51.648s 8192 0m33.881s 4m55.342s 16384 0m35.190s 5m00.122s 32768 0m35.060s 4m58.851s 65536 0m33.311s* 4m51.420s It's similarly good. As with the "separate blob LRU" strategy, there's a lot of noise on the log-S run here. But it's certainly not any worse, is possibly a bit better, and the improvement over "separate blob LRU" on the git.git case is dramatic. So it seems like a clear winner, and that's what this patch implements. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-08-23delta_base_cache: use list.h for LRUJeff King
We keep an LRU list of entries for when we need to drop something from an over-full cache. The list is implemented as a circular doubly-linked list, which is exactly what list.h provides. We can save a few lines by using the list.h macros and functions. More importantly, this makes the code easier to follow, as the reader sees explicit concepts like "list_add_tail()" instead of pointer manipulation. As a bonus, the list_entry() macro lets us place the lru pointers anywhere inside the delta_base_cache_entry struct (as opposed to just casting the pointer, which requires it at the front of the struct). This will be useful in later patches when we need to place other items at the front of the struct (e.g., our hashmap implementation requires this). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>