summaryrefslogtreecommitdiff
path: root/Documentation/technical
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/technical')
-rw-r--r--Documentation/technical/api-allocation-growing.txt39
-rw-r--r--Documentation/technical/api-argv-array.txt65
-rw-r--r--Documentation/technical/api-config.txt319
-rw-r--r--Documentation/technical/api-credentials.txt271
-rw-r--r--Documentation/technical/api-diff.txt174
-rw-r--r--Documentation/technical/api-directory-listing.txt130
-rw-r--r--Documentation/technical/api-error-handling.txt32
-rw-r--r--Documentation/technical/api-gitattributes.txt154
-rw-r--r--Documentation/technical/api-grep.txt8
-rw-r--r--Documentation/technical/api-history-graph.txt173
-rw-r--r--Documentation/technical/api-index-skel.txt2
-rw-r--r--Documentation/technical/api-merge.txt72
-rw-r--r--Documentation/technical/api-object-access.txt15
-rw-r--r--Documentation/technical/api-oid-array.txt90
-rw-r--r--Documentation/technical/api-parse-options.txt58
-rw-r--r--Documentation/technical/api-quote.txt10
-rw-r--r--Documentation/technical/api-ref-iteration.txt78
-rw-r--r--Documentation/technical/api-remote.txt127
-rw-r--r--Documentation/technical/api-revision-walking.txt72
-rw-r--r--Documentation/technical/api-run-command.txt264
-rw-r--r--Documentation/technical/api-setup.txt47
-rw-r--r--Documentation/technical/api-sigchain.txt41
-rw-r--r--Documentation/technical/api-simple-ipc.txt105
-rw-r--r--Documentation/technical/api-submodule-config.txt66
-rw-r--r--Documentation/technical/api-trace.txt140
-rw-r--r--Documentation/technical/api-trace2.txt568
-rw-r--r--Documentation/technical/api-tree-walking.txt149
-rw-r--r--Documentation/technical/api-xdiff-interface.txt7
-rw-r--r--Documentation/technical/bitmap-format.txt277
-rw-r--r--Documentation/technical/bundle-uri.txt572
-rw-r--r--Documentation/technical/commit-graph-format.txt104
-rw-r--r--Documentation/technical/commit-graph.txt115
-rw-r--r--Documentation/technical/directory-rename-detection.txt29
-rw-r--r--Documentation/technical/hash-function-transition.txt305
-rw-r--r--Documentation/technical/http-protocol.txt518
-rw-r--r--Documentation/technical/index-format.txt357
-rw-r--r--Documentation/technical/long-running-process-protocol.txt2
-rw-r--r--Documentation/technical/multi-pack-index.txt45
-rw-r--r--Documentation/technical/pack-format.txt331
-rw-r--r--Documentation/technical/pack-protocol.txt674
-rw-r--r--Documentation/technical/packfile-uri.txt82
-rw-r--r--Documentation/technical/parallel-checkout.txt270
-rw-r--r--Documentation/technical/partial-clone.txt46
-rw-r--r--Documentation/technical/protocol-capabilities.txt337
-rw-r--r--Documentation/technical/protocol-common.txt99
-rw-r--r--Documentation/technical/protocol-v2.txt455
-rw-r--r--Documentation/technical/racy-git.txt12
-rw-r--r--Documentation/technical/reftable.txt1098
-rw-r--r--Documentation/technical/remembering-renames.txt671
-rw-r--r--Documentation/technical/repository-version.txt12
-rw-r--r--Documentation/technical/rerere.txt20
-rw-r--r--Documentation/technical/scalar.txt66
-rw-r--r--Documentation/technical/shallow.txt2
-rw-r--r--Documentation/technical/signature-format.txt186
-rw-r--r--Documentation/technical/sparse-checkout.txt1103
-rw-r--r--Documentation/technical/sparse-index.txt208
-rw-r--r--Documentation/technical/unit-tests.txt240
57 files changed, 5253 insertions, 6259 deletions
diff --git a/Documentation/technical/api-allocation-growing.txt b/Documentation/technical/api-allocation-growing.txt
deleted file mode 100644
index 5a59b54..0000000
--- a/Documentation/technical/api-allocation-growing.txt
+++ /dev/null
@@ -1,39 +0,0 @@
-allocation growing API
-======================
-
-Dynamically growing an array using realloc() is error prone and boring.
-
-Define your array with:
-
-* a pointer (`item`) that points at the array, initialized to `NULL`
- (although please name the variable based on its contents, not on its
- type);
-
-* an integer variable (`alloc`) that keeps track of how big the current
- allocation is, initialized to `0`;
-
-* another integer variable (`nr`) to keep track of how many elements the
- array currently has, initialized to `0`.
-
-Then before adding `n`th element to the item, call `ALLOC_GROW(item, n,
-alloc)`. This ensures that the array can hold at least `n` elements by
-calling `realloc(3)` and adjusting `alloc` variable.
-
-------------
-sometype *item;
-size_t nr;
-size_t alloc
-
-for (i = 0; i < nr; i++)
- if (we like item[i] already)
- return;
-
-/* we did not like any existing one, so add one */
-ALLOC_GROW(item, nr + 1, alloc);
-item[nr++] = value you like;
-------------
-
-You are responsible for updating the `nr` variable.
-
-If you need to specify the number of elements to allocate explicitly
-then use the macro `REALLOC_ARRAY(item, alloc)` instead of `ALLOC_GROW`.
diff --git a/Documentation/technical/api-argv-array.txt b/Documentation/technical/api-argv-array.txt
deleted file mode 100644
index 870c8ed..0000000
--- a/Documentation/technical/api-argv-array.txt
+++ /dev/null
@@ -1,65 +0,0 @@
-argv-array API
-==============
-
-The argv-array API allows one to dynamically build and store
-NULL-terminated lists. An argv-array maintains the invariant that the
-`argv` member always points to a non-NULL array, and that the array is
-always NULL-terminated at the element pointed to by `argv[argc]`. This
-makes the result suitable for passing to functions expecting to receive
-argv from main(), or the link:api-run-command.html[run-command API].
-
-The string-list API (documented in string-list.h) is similar, but cannot be
-used for these purposes; instead of storing a straight string pointer,
-it contains an item structure with a `util` field that is not compatible
-with the traditional argv interface.
-
-Each `argv_array` manages its own memory. Any strings pushed into the
-array are duplicated, and all memory is freed by argv_array_clear().
-
-Data Structures
----------------
-
-`struct argv_array`::
-
- A single array. This should be initialized by assignment from
- `ARGV_ARRAY_INIT`, or by calling `argv_array_init`. The `argv`
- member contains the actual array; the `argc` member contains the
- number of elements in the array, not including the terminating
- NULL.
-
-Functions
----------
-
-`argv_array_init`::
- Initialize an array. This is no different than assigning from
- `ARGV_ARRAY_INIT`.
-
-`argv_array_push`::
- Push a copy of a string onto the end of the array.
-
-`argv_array_pushl`::
- Push a list of strings onto the end of the array. The arguments
- should be a list of `const char *` strings, terminated by a NULL
- argument.
-
-`argv_array_pushf`::
- Format a string and push it onto the end of the array. This is a
- convenience wrapper combining `strbuf_addf` and `argv_array_push`.
-
-`argv_array_pushv`::
- Push a null-terminated array of strings onto the end of the array.
-
-`argv_array_pop`::
- Remove the final element from the array. If there are no
- elements in the array, do nothing.
-
-`argv_array_clear`::
- Free all memory associated with the array and return it to the
- initial, empty state.
-
-`argv_array_detach`::
- Disconnect the `argv` member from the `argv_array` struct and
- return it. The caller is responsible for freeing the memory used
- by the array, and by the strings it references. After detaching,
- the `argv_array` is in a reinitialized state and can be pushed
- into again.
diff --git a/Documentation/technical/api-config.txt b/Documentation/technical/api-config.txt
deleted file mode 100644
index 7d20716..0000000
--- a/Documentation/technical/api-config.txt
+++ /dev/null
@@ -1,319 +0,0 @@
-config API
-==========
-
-The config API gives callers a way to access Git configuration files
-(and files which have the same syntax). See linkgit:git-config[1] for a
-discussion of the config file syntax.
-
-General Usage
--------------
-
-Config files are parsed linearly, and each variable found is passed to a
-caller-provided callback function. The callback function is responsible
-for any actions to be taken on the config option, and is free to ignore
-some options. It is not uncommon for the configuration to be parsed
-several times during the run of a Git program, with different callbacks
-picking out different variables useful to themselves.
-
-A config callback function takes three parameters:
-
-- the name of the parsed variable. This is in canonical "flat" form: the
- section, subsection, and variable segments will be separated by dots,
- and the section and variable segments will be all lowercase. E.g.,
- `core.ignorecase`, `diff.SomeType.textconv`.
-
-- the value of the found variable, as a string. If the variable had no
- value specified, the value will be NULL (typically this means it
- should be interpreted as boolean true).
-
-- a void pointer passed in by the caller of the config API; this can
- contain callback-specific data
-
-A config callback should return 0 for success, or -1 if the variable
-could not be parsed properly.
-
-Basic Config Querying
----------------------
-
-Most programs will simply want to look up variables in all config files
-that Git knows about, using the normal precedence rules. To do this,
-call `git_config` with a callback function and void data pointer.
-
-`git_config` will read all config sources in order of increasing
-priority. Thus a callback should typically overwrite previously-seen
-entries with new ones (e.g., if both the user-wide `~/.gitconfig` and
-repo-specific `.git/config` contain `color.ui`, the config machinery
-will first feed the user-wide one to the callback, and then the
-repo-specific one; by overwriting, the higher-priority repo-specific
-value is left at the end).
-
-The `config_with_options` function lets the caller examine config
-while adjusting some of the default behavior of `git_config`. It should
-almost never be used by "regular" Git code that is looking up
-configuration variables. It is intended for advanced callers like
-`git-config`, which are intentionally tweaking the normal config-lookup
-process. It takes two extra parameters:
-
-`config_source`::
-If this parameter is non-NULL, it specifies the source to parse for
-configuration, rather than looking in the usual files. See `struct
-git_config_source` in `config.h` for details. Regular `git_config` defaults
-to `NULL`.
-
-`opts`::
-Specify options to adjust the behavior of parsing config files. See `struct
-config_options` in `config.h` for details. As an example: regular `git_config`
-sets `opts.respect_includes` to `1` by default.
-
-Reading Specific Files
-----------------------
-
-To read a specific file in git-config format, use
-`git_config_from_file`. This takes the same callback and data parameters
-as `git_config`.
-
-Querying For Specific Variables
--------------------------------
-
-For programs wanting to query for specific variables in a non-callback
-manner, the config API provides two functions `git_config_get_value`
-and `git_config_get_value_multi`. They both read values from an internal
-cache generated previously from reading the config files.
-
-`int git_config_get_value(const char *key, const char **value)`::
-
- Finds the highest-priority value for the configuration variable `key`,
- stores the pointer to it in `value` and returns 0. When the
- configuration variable `key` is not found, returns 1 without touching
- `value`. The caller should not free or modify `value`, as it is owned
- by the cache.
-
-`const struct string_list *git_config_get_value_multi(const char *key)`::
-
- Finds and returns the value list, sorted in order of increasing priority
- for the configuration variable `key`. When the configuration variable
- `key` is not found, returns NULL. The caller should not free or modify
- the returned pointer, as it is owned by the cache.
-
-`void git_config_clear(void)`::
-
- Resets and invalidates the config cache.
-
-The config API also provides type specific API functions which do conversion
-as well as retrieval for the queried variable, including:
-
-`int git_config_get_int(const char *key, int *dest)`::
-
- Finds and parses the value to an integer for the configuration variable
- `key`. Dies on error; otherwise, stores the value of the parsed integer in
- `dest` and returns 0. When the configuration variable `key` is not found,
- returns 1 without touching `dest`.
-
-`int git_config_get_ulong(const char *key, unsigned long *dest)`::
-
- Similar to `git_config_get_int` but for unsigned longs.
-
-`int git_config_get_bool(const char *key, int *dest)`::
-
- Finds and parses the value into a boolean value, for the configuration
- variable `key` respecting keywords like "true" and "false". Integer
- values are converted into true/false values (when they are non-zero or
- zero, respectively). Other values cause a die(). If parsing is successful,
- stores the value of the parsed result in `dest` and returns 0. When the
- configuration variable `key` is not found, returns 1 without touching
- `dest`.
-
-`int git_config_get_bool_or_int(const char *key, int *is_bool, int *dest)`::
-
- Similar to `git_config_get_bool`, except that integers are copied as-is,
- and `is_bool` flag is unset.
-
-`int git_config_get_maybe_bool(const char *key, int *dest)`::
-
- Similar to `git_config_get_bool`, except that it returns -1 on error
- rather than dying.
-
-`int git_config_get_string_const(const char *key, const char **dest)`::
-
- Allocates and copies the retrieved string into the `dest` parameter for
- the configuration variable `key`; if NULL string is given, prints an
- error message and returns -1. When the configuration variable `key` is
- not found, returns 1 without touching `dest`.
-
-`int git_config_get_string(const char *key, char **dest)`::
-
- Similar to `git_config_get_string_const`, except that retrieved value
- copied into the `dest` parameter is a mutable string.
-
-`int git_config_get_pathname(const char *key, const char **dest)`::
-
- Similar to `git_config_get_string`, but expands `~` or `~user` into
- the user's home directory when found at the beginning of the path.
-
-`git_die_config(const char *key, const char *err, ...)`::
-
- First prints the error message specified by the caller in `err` and then
- dies printing the line number and the file name of the highest priority
- value for the configuration variable `key`.
-
-`void git_die_config_linenr(const char *key, const char *filename, int linenr)`::
-
- Helper function which formats the die error message according to the
- parameters entered. Used by `git_die_config()`. It can be used by callers
- handling `git_config_get_value_multi()` to print the correct error message
- for the desired value.
-
-See test-config.c for usage examples.
-
-Value Parsing Helpers
----------------------
-
-To aid in parsing string values, the config API provides callbacks with
-a number of helper functions, including:
-
-`git_config_int`::
-Parse the string to an integer, including unit factors. Dies on error;
-otherwise, returns the parsed result.
-
-`git_config_ulong`::
-Identical to `git_config_int`, but for unsigned longs.
-
-`git_config_bool`::
-Parse a string into a boolean value, respecting keywords like "true" and
-"false". Integer values are converted into true/false values (when they
-are non-zero or zero, respectively). Other values cause a die(). If
-parsing is successful, the return value is the result.
-
-`git_config_bool_or_int`::
-Same as `git_config_bool`, except that integers are returned as-is, and
-an `is_bool` flag is unset.
-
-`git_parse_maybe_bool`::
-Same as `git_config_bool`, except that it returns -1 on error rather
-than dying.
-
-`git_config_string`::
-Allocates and copies the value string into the `dest` parameter; if no
-string is given, prints an error message and returns -1.
-
-`git_config_pathname`::
-Similar to `git_config_string`, but expands `~` or `~user` into the
-user's home directory when found at the beginning of the path.
-
-Include Directives
-------------------
-
-By default, the config parser does not respect include directives.
-However, a caller can use the special `git_config_include` wrapper
-callback to support them. To do so, you simply wrap your "real" callback
-function and data pointer in a `struct config_include_data`, and pass
-the wrapper to the regular config-reading functions. For example:
-
--------------------------------------------
-int read_file_with_include(const char *file, config_fn_t fn, void *data)
-{
- struct config_include_data inc = CONFIG_INCLUDE_INIT;
- inc.fn = fn;
- inc.data = data;
- return git_config_from_file(git_config_include, file, &inc);
-}
--------------------------------------------
-
-`git_config` respects includes automatically. The lower-level
-`git_config_from_file` does not.
-
-Custom Configsets
------------------
-
-A `config_set` can be used to construct an in-memory cache for
-config-like files that the caller specifies (i.e., files like `.gitmodules`,
-`~/.gitconfig` etc.). For example,
-
-----------------------------------------
-struct config_set gm_config;
-git_configset_init(&gm_config);
-int b;
-/* we add config files to the config_set */
-git_configset_add_file(&gm_config, ".gitmodules");
-git_configset_add_file(&gm_config, ".gitmodules_alt");
-
-if (!git_configset_get_bool(gm_config, "submodule.frotz.ignore", &b)) {
- /* hack hack hack */
-}
-
-/* when we are done with the configset */
-git_configset_clear(&gm_config);
-----------------------------------------
-
-Configset API provides functions for the above mentioned work flow, including:
-
-`void git_configset_init(struct config_set *cs)`::
-
- Initializes the config_set `cs`.
-
-`int git_configset_add_file(struct config_set *cs, const char *filename)`::
-
- Parses the file and adds the variable-value pairs to the `config_set`,
- dies if there is an error in parsing the file. Returns 0 on success, or
- -1 if the file does not exist or is inaccessible. The user has to decide
- if he wants to free the incomplete configset or continue using it when
- the function returns -1.
-
-`int git_configset_get_value(struct config_set *cs, const char *key, const char **value)`::
-
- Finds the highest-priority value for the configuration variable `key`
- and config set `cs`, stores the pointer to it in `value` and returns 0.
- When the configuration variable `key` is not found, returns 1 without
- touching `value`. The caller should not free or modify `value`, as it
- is owned by the cache.
-
-`const struct string_list *git_configset_get_value_multi(struct config_set *cs, const char *key)`::
-
- Finds and returns the value list, sorted in order of increasing priority
- for the configuration variable `key` and config set `cs`. When the
- configuration variable `key` is not found, returns NULL. The caller
- should not free or modify the returned pointer, as it is owned by the cache.
-
-`void git_configset_clear(struct config_set *cs)`::
-
- Clears `config_set` structure, removes all saved variable-value pairs.
-
-In addition to above functions, the `config_set` API provides type specific
-functions in the vein of `git_config_get_int` and family but with an extra
-parameter, pointer to struct `config_set`.
-They all behave similarly to the `git_config_get*()` family described in
-"Querying For Specific Variables" above.
-
-Writing Config Files
---------------------
-
-Git gives multiple entry points in the Config API to write config values to
-files namely `git_config_set_in_file` and `git_config_set`, which write to
-a specific config file or to `.git/config` respectively. They both take a
-key/value pair as parameter.
-In the end they both call `git_config_set_multivar_in_file` which takes four
-parameters:
-
-- the name of the file, as a string, to which key/value pairs will be written.
-
-- the name of key, as a string. This is in canonical "flat" form: the section,
- subsection, and variable segments will be separated by dots, and the section
- and variable segments will be all lowercase.
- E.g., `core.ignorecase`, `diff.SomeType.textconv`.
-
-- the value of the variable, as a string. If value is equal to NULL, it will
- remove the matching key from the config file.
-
-- the value regex, as a string. It will disregard key/value pairs where value
- does not match.
-
-- a multi_replace value, as an int. If value is equal to zero, nothing or only
- one matching key/value is replaced, else all matching key/values (regardless
- how many) are removed, before the new pair is written.
-
-It returns 0 on success.
-
-Also, there are functions `git_config_rename_section` and
-`git_config_rename_section_in_file` with parameters `old_name` and `new_name`
-for renaming or removing sections in the config files. If NULL is passed
-through `new_name` parameter, the section will be removed from the config file.
diff --git a/Documentation/technical/api-credentials.txt b/Documentation/technical/api-credentials.txt
deleted file mode 100644
index 75368f2..0000000
--- a/Documentation/technical/api-credentials.txt
+++ /dev/null
@@ -1,271 +0,0 @@
-credentials API
-===============
-
-The credentials API provides an abstracted way of gathering username and
-password credentials from the user (even though credentials in the wider
-world can take many forms, in this document the word "credential" always
-refers to a username and password pair).
-
-This document describes two interfaces: the C API that the credential
-subsystem provides to the rest of Git, and the protocol that Git uses to
-communicate with system-specific "credential helpers". If you are
-writing Git code that wants to look up or prompt for credentials, see
-the section "C API" below. If you want to write your own helper, see
-the section on "Credential Helpers" below.
-
-Typical setup
--------------
-
-------------
-+-----------------------+
-| Git code (C) |--- to server requiring --->
-| | authentication
-|.......................|
-| C credential API |--- prompt ---> User
-+-----------------------+
- ^ |
- | pipe |
- | v
-+-----------------------+
-| Git credential helper |
-+-----------------------+
-------------
-
-The Git code (typically a remote-helper) will call the C API to obtain
-credential data like a login/password pair (credential_fill). The
-API will itself call a remote helper (e.g. "git credential-cache" or
-"git credential-store") that may retrieve credential data from a
-store. If the credential helper cannot find the information, the C API
-will prompt the user. Then, the caller of the API takes care of
-contacting the server, and does the actual authentication.
-
-C API
------
-
-The credential C API is meant to be called by Git code which needs to
-acquire or store a credential. It is centered around an object
-representing a single credential and provides three basic operations:
-fill (acquire credentials by calling helpers and/or prompting the user),
-approve (mark a credential as successfully used so that it can be stored
-for later use), and reject (mark a credential as unsuccessful so that it
-can be erased from any persistent storage).
-
-Data Structures
-~~~~~~~~~~~~~~~
-
-`struct credential`::
-
- This struct represents a single username/password combination
- along with any associated context. All string fields should be
- heap-allocated (or NULL if they are not known or not applicable).
- The meaning of the individual context fields is the same as
- their counterparts in the helper protocol; see the section below
- for a description of each field.
-+
-The `helpers` member of the struct is a `string_list` of helpers. Each
-string specifies an external helper which will be run, in order, to
-either acquire or store credentials. See the section on credential
-helpers below. This list is filled-in by the API functions
-according to the corresponding configuration variables before
-consulting helpers, so there usually is no need for a caller to
-modify the helpers field at all.
-+
-This struct should always be initialized with `CREDENTIAL_INIT` or
-`credential_init`.
-
-
-Functions
-~~~~~~~~~
-
-`credential_init`::
-
- Initialize a credential structure, setting all fields to empty.
-
-`credential_clear`::
-
- Free any resources associated with the credential structure,
- returning it to a pristine initialized state.
-
-`credential_fill`::
-
- Instruct the credential subsystem to fill the username and
- password fields of the passed credential struct by first
- consulting helpers, then asking the user. After this function
- returns, the username and password fields of the credential are
- guaranteed to be non-NULL. If an error occurs, the function will
- die().
-
-`credential_reject`::
-
- Inform the credential subsystem that the provided credentials
- have been rejected. This will cause the credential subsystem to
- notify any helpers of the rejection (which allows them, for
- example, to purge the invalid credentials from storage). It
- will also free() the username and password fields of the
- credential and set them to NULL (readying the credential for
- another call to `credential_fill`). Any errors from helpers are
- ignored.
-
-`credential_approve`::
-
- Inform the credential subsystem that the provided credentials
- were successfully used for authentication. This will cause the
- credential subsystem to notify any helpers of the approval, so
- that they may store the result to be used again. Any errors
- from helpers are ignored.
-
-`credential_from_url`::
-
- Parse a URL into broken-down credential fields.
-
-Example
-~~~~~~~
-
-The example below shows how the functions of the credential API could be
-used to login to a fictitious "foo" service on a remote host:
-
------------------------------------------------------------------------
-int foo_login(struct foo_connection *f)
-{
- int status;
- /*
- * Create a credential with some context; we don't yet know the
- * username or password.
- */
-
- struct credential c = CREDENTIAL_INIT;
- c.protocol = xstrdup("foo");
- c.host = xstrdup(f->hostname);
-
- /*
- * Fill in the username and password fields by contacting
- * helpers and/or asking the user. The function will die if it
- * fails.
- */
- credential_fill(&c);
-
- /*
- * Otherwise, we have a username and password. Try to use it.
- */
- status = send_foo_login(f, c.username, c.password);
- switch (status) {
- case FOO_OK:
- /* It worked. Store the credential for later use. */
- credential_accept(&c);
- break;
- case FOO_BAD_LOGIN:
- /* Erase the credential from storage so we don't try it
- * again. */
- credential_reject(&c);
- break;
- default:
- /*
- * Some other error occurred. We don't know if the
- * credential is good or bad, so report nothing to the
- * credential subsystem.
- */
- }
-
- /* Free any associated resources. */
- credential_clear(&c);
-
- return status;
-}
------------------------------------------------------------------------
-
-
-Credential Helpers
-------------------
-
-Credential helpers are programs executed by Git to fetch or save
-credentials from and to long-term storage (where "long-term" is simply
-longer than a single Git process; e.g., credentials may be stored
-in-memory for a few minutes, or indefinitely on disk).
-
-Each helper is specified by a single string in the configuration
-variable `credential.helper` (and others, see linkgit:git-config[1]).
-The string is transformed by Git into a command to be executed using
-these rules:
-
- 1. If the helper string begins with "!", it is considered a shell
- snippet, and everything after the "!" becomes the command.
-
- 2. Otherwise, if the helper string begins with an absolute path, the
- verbatim helper string becomes the command.
-
- 3. Otherwise, the string "git credential-" is prepended to the helper
- string, and the result becomes the command.
-
-The resulting command then has an "operation" argument appended to it
-(see below for details), and the result is executed by the shell.
-
-Here are some example specifications:
-
-----------------------------------------------------
-# run "git credential-foo"
-foo
-
-# same as above, but pass an argument to the helper
-foo --bar=baz
-
-# the arguments are parsed by the shell, so use shell
-# quoting if necessary
-foo --bar="whitespace arg"
-
-# you can also use an absolute path, which will not use the git wrapper
-/path/to/my/helper --with-arguments
-
-# or you can specify your own shell snippet
-!f() { echo "password=`cat $HOME/.secret`"; }; f
-----------------------------------------------------
-
-Generally speaking, rule (3) above is the simplest for users to specify.
-Authors of credential helpers should make an effort to assist their
-users by naming their program "git-credential-$NAME", and putting it in
-the $PATH or $GIT_EXEC_PATH during installation, which will allow a user
-to enable it with `git config credential.helper $NAME`.
-
-When a helper is executed, it will have one "operation" argument
-appended to its command line, which is one of:
-
-`get`::
-
- Return a matching credential, if any exists.
-
-`store`::
-
- Store the credential, if applicable to the helper.
-
-`erase`::
-
- Remove a matching credential, if any, from the helper's storage.
-
-The details of the credential will be provided on the helper's stdin
-stream. The exact format is the same as the input/output format of the
-`git credential` plumbing command (see the section `INPUT/OUTPUT
-FORMAT` in linkgit:git-credential[1] for a detailed specification).
-
-For a `get` operation, the helper should produce a list of attributes
-on stdout in the same format. A helper is free to produce a subset, or
-even no values at all if it has nothing useful to provide. Any provided
-attributes will overwrite those already known about by Git. If a helper
-outputs a `quit` attribute with a value of `true` or `1`, no further
-helpers will be consulted, nor will the user be prompted (if no
-credential has been provided, the operation will then fail).
-
-For a `store` or `erase` operation, the helper's output is ignored.
-If it fails to perform the requested operation, it may complain to
-stderr to inform the user. If it does not support the requested
-operation (e.g., a read-only store), it should silently ignore the
-request.
-
-If a helper receives any other operation, it should silently ignore the
-request. This leaves room for future operations to be added (older
-helpers will just ignore the new requests).
-
-See also
---------
-
-linkgit:gitcredentials[7]
-
-linkgit:git-config[1] (See configuration variables `credential.*`)
diff --git a/Documentation/technical/api-diff.txt b/Documentation/technical/api-diff.txt
deleted file mode 100644
index 30fc0e9..0000000
--- a/Documentation/technical/api-diff.txt
+++ /dev/null
@@ -1,174 +0,0 @@
-diff API
-========
-
-The diff API is for programs that compare two sets of files (e.g. two
-trees, one tree and the index) and present the found difference in
-various ways. The calling program is responsible for feeding the API
-pairs of files, one from the "old" set and the corresponding one from
-"new" set, that are different. The library called through this API is
-called diffcore, and is responsible for two things.
-
-* finding total rewrites (`-B`), renames (`-M`) and copies (`-C`), and
- changes that touch a string (`-S`), as specified by the caller.
-
-* outputting the differences in various formats, as specified by the
- caller.
-
-Calling sequence
-----------------
-
-* Prepare `struct diff_options` to record the set of diff options, and
- then call `repo_diff_setup()` to initialize this structure. This
- sets up the vanilla default.
-
-* Fill in the options structure to specify desired output format, rename
- detection, etc. `diff_opt_parse()` can be used to parse options given
- from the command line in a way consistent with existing git-diff
- family of programs.
-
-* Call `diff_setup_done()`; this inspects the options set up so far for
- internal consistency and make necessary tweaking to it (e.g. if
- textual patch output was asked, recursive behaviour is turned on);
- the callback set_default in diff_options can be used to tweak this more.
-
-* As you find different pairs of files, call `diff_change()` to feed
- modified files, `diff_addremove()` to feed created or deleted files,
- or `diff_unmerge()` to feed a file whose state is 'unmerged' to the
- API. These are thin wrappers to a lower-level `diff_queue()` function
- that is flexible enough to record any of these kinds of changes.
-
-* Once you finish feeding the pairs of files, call `diffcore_std()`.
- This will tell the diffcore library to go ahead and do its work.
-
-* Calling `diff_flush()` will produce the output.
-
-
-Data structures
----------------
-
-* `struct diff_filespec`
-
-This is the internal representation for a single file (blob). It
-records the blob object name (if known -- for a work tree file it
-typically is a NUL SHA-1), filemode and pathname. This is what the
-`diff_addremove()`, `diff_change()` and `diff_unmerge()` synthesize and
-feed `diff_queue()` function with.
-
-* `struct diff_filepair`
-
-This records a pair of `struct diff_filespec`; the filespec for a file
-in the "old" set (i.e. preimage) is called `one`, and the filespec for a
-file in the "new" set (i.e. postimage) is called `two`. A change that
-represents file creation has NULL in `one`, and file deletion has NULL
-in `two`.
-
-A `filepair` starts pointing at `one` and `two` that are from the same
-filename, but `diffcore_std()` can break pairs and match component
-filespecs with other filespecs from a different filepair to form new
-filepair. This is called 'rename detection'.
-
-* `struct diff_queue`
-
-This is a collection of filepairs. Notable members are:
-
-`queue`::
-
- An array of pointers to `struct diff_filepair`. This
- dynamically grows as you add filepairs;
-
-`alloc`::
-
- The allocated size of the `queue` array;
-
-`nr`::
-
- The number of elements in the `queue` array.
-
-
-* `struct diff_options`
-
-This describes the set of options the calling program wants to affect
-the operation of diffcore library with.
-
-Notable members are:
-
-`output_format`::
- The output format used when `diff_flush()` is run.
-
-`context`::
- Number of context lines to generate in patch output.
-
-`break_opt`, `detect_rename`, `rename-score`, `rename_limit`::
- Affects the way detection logic for complete rewrites, renames
- and copies.
-
-`abbrev`::
- Number of hexdigits to abbreviate raw format output to.
-
-`pickaxe`::
- A constant string (can and typically does contain newlines to
- look for a block of text, not just a single line) to filter out
- the filepairs that do not change the number of strings contained
- in its preimage and postimage of the diff_queue.
-
-`flags`::
- This is mostly a collection of boolean options that affects the
- operation, but some do not have anything to do with the diffcore
- library.
-
-`touched_flags`::
- Records whether a flag has been changed due to user request
- (rather than just set/unset by default).
-
-`set_default`::
- Callback which allows tweaking the options in diff_setup_done().
-
-BINARY, TEXT;;
- Affects the way how a file that is seemingly binary is treated.
-
-FULL_INDEX;;
- Tells the patch output format not to use abbreviated object
- names on the "index" lines.
-
-FIND_COPIES_HARDER;;
- Tells the diffcore library that the caller is feeding unchanged
- filepairs to allow copies from unmodified files be detected.
-
-COLOR_DIFF;;
- Output should be colored.
-
-COLOR_DIFF_WORDS;;
- Output is a colored word-diff.
-
-NO_INDEX;;
- Tells diff-files that the input is not tracked files but files
- in random locations on the filesystem.
-
-ALLOW_EXTERNAL;;
- Tells output routine that it is Ok to call user specified patch
- output routine. Plumbing disables this to ensure stable output.
-
-QUIET;;
- Do not show any output.
-
-REVERSE_DIFF;;
- Tells the library that the calling program is feeding the
- filepairs reversed; `one` is two, and `two` is one.
-
-EXIT_WITH_STATUS;;
- For communication between the calling program and the options
- parser; tell the calling program to signal the presence of
- difference using program exit code.
-
-HAS_CHANGES;;
- Internal; used for optimization to see if there is any change.
-
-SILENT_ON_REMOVE;;
- Affects if diff-files shows removed files.
-
-RECURSIVE, TREE_IN_RECURSIVE;;
- Tells if tree traversal done by tree-diff should recursively
- descend into a tree object pair that are different in preimage
- and postimage set.
-
-(JC)
diff --git a/Documentation/technical/api-directory-listing.txt b/Documentation/technical/api-directory-listing.txt
deleted file mode 100644
index 76b6e4f..0000000
--- a/Documentation/technical/api-directory-listing.txt
+++ /dev/null
@@ -1,130 +0,0 @@
-directory listing API
-=====================
-
-The directory listing API is used to enumerate paths in the work tree,
-optionally taking `.git/info/exclude` and `.gitignore` files per
-directory into account.
-
-Data structure
---------------
-
-`struct dir_struct` structure is used to pass directory traversal
-options to the library and to record the paths discovered. A single
-`struct dir_struct` is used regardless of whether or not the traversal
-recursively descends into subdirectories.
-
-The notable options are:
-
-`exclude_per_dir`::
-
- The name of the file to be read in each directory for excluded
- files (typically `.gitignore`).
-
-`flags`::
-
- A bit-field of options:
-
-`DIR_SHOW_IGNORED`:::
-
- Return just ignored files in `entries[]`, not untracked
- files. This flag is mutually exclusive with
- `DIR_SHOW_IGNORED_TOO`.
-
-`DIR_SHOW_IGNORED_TOO`:::
-
- Similar to `DIR_SHOW_IGNORED`, but return ignored files in
- `ignored[]` in addition to untracked files in
- `entries[]`. This flag is mutually exclusive with
- `DIR_SHOW_IGNORED`.
-
-`DIR_KEEP_UNTRACKED_CONTENTS`:::
-
- Only has meaning if `DIR_SHOW_IGNORED_TOO` is also set; if this is set, the
- untracked contents of untracked directories are also returned in
- `entries[]`.
-
-`DIR_SHOW_IGNORED_TOO_MODE_MATCHING`:::
-
- Only has meaning if `DIR_SHOW_IGNORED_TOO` is also set; if
- this is set, returns ignored files and directories that match
- an exclude pattern. If a directory matches an exclude pattern,
- then the directory is returned and the contained paths are
- not. A directory that does not match an exclude pattern will
- not be returned even if all of its contents are ignored. In
- this case, the contents are returned as individual entries.
-+
-If this is set, files and directories that explicitly match an ignore
-pattern are reported. Implicitly ignored directories (directories that
-do not match an ignore pattern, but whose contents are all ignored)
-are not reported, instead all of the contents are reported.
-
-`DIR_COLLECT_IGNORED`:::
-
- Special mode for git-add. Return ignored files in `ignored[]` and
- untracked files in `entries[]`. Only returns ignored files that match
- pathspec exactly (no wildcards). Does not recurse into ignored
- directories.
-
-`DIR_SHOW_OTHER_DIRECTORIES`:::
-
- Include a directory that is not tracked.
-
-`DIR_HIDE_EMPTY_DIRECTORIES`:::
-
- Do not include a directory that is not tracked and is empty.
-
-`DIR_NO_GITLINKS`:::
-
- If set, recurse into a directory that looks like a Git
- directory. Otherwise it is shown as a directory.
-
-The result of the enumeration is left in these fields:
-
-`entries[]`::
-
- An array of `struct dir_entry`, each element of which describes
- a path.
-
-`nr`::
-
- The number of members in `entries[]` array.
-
-`alloc`::
-
- Internal use; keeps track of allocation of `entries[]` array.
-
-`ignored[]`::
-
- An array of `struct dir_entry`, used for ignored paths with the
- `DIR_SHOW_IGNORED_TOO` and `DIR_COLLECT_IGNORED` flags.
-
-`ignored_nr`::
-
- The number of members in `ignored[]` array.
-
-Calling sequence
-----------------
-
-Note: index may be looked at for .gitignore files that are CE_SKIP_WORKTREE
-marked. If you to exclude files, make sure you have loaded index first.
-
-* Prepare `struct dir_struct dir` and clear it with `memset(&dir, 0,
- sizeof(dir))`.
-
-* To add single exclude pattern, call `add_pattern_list()` and then
- `add_pattern()`.
-
-* To add patterns from a file (e.g. `.git/info/exclude`), call
- `add_patterns_from_file()` , and/or set `dir.exclude_per_dir`. A
- short-hand function `setup_standard_excludes()` can be used to set
- up the standard set of exclude settings.
-
-* Set options described in the Data Structure section above.
-
-* Call `read_directory()`.
-
-* Use `dir.entries[]`.
-
-* Call `clear_directory()` when none of the contained elements are no longer in use.
-
-(JC)
diff --git a/Documentation/technical/api-error-handling.txt b/Documentation/technical/api-error-handling.txt
index ceeedd4..665c496 100644
--- a/Documentation/technical/api-error-handling.txt
+++ b/Documentation/technical/api-error-handling.txt
@@ -1,8 +1,33 @@
Error reporting in git
======================
-`die`, `usage`, `error`, and `warning` report errors of various
-kinds.
+`BUG`, `bug`, `die`, `usage`, `error`, and `warning` report errors of
+various kinds.
+
+- `BUG` is for failed internal assertions that should never happen,
+ i.e. a bug in git itself.
+
+- `bug` (lower-case, not `BUG`) is supposed to be used like `BUG` but
+ prints a "BUG" message instead of calling `abort()`.
++
+A call to `bug()` will then result in a "real" call to the `BUG()`
+function, either explicitly by invoking `BUG_if_bug()` after call(s)
+to `bug()`, or implicitly at `exit()` time where we'll check if we
+encountered any outstanding `bug()` invocations.
++
+If there were no prior calls to `bug()` before invoking `BUG_if_bug()`
+the latter is a NOOP. The `BUG_if_bug()` function takes the same
+arguments as `BUG()` itself. Calling `BUG_if_bug()` explicitly isn't
+necessary, but ensures that we die as soon as possible.
++
+If you know you had prior calls to `bug()` then calling `BUG()` itself
+is equivalent to calling `BUG_if_bug()`, the latter being a wrapper
+calling `BUG()` if we've set a flag indicating that we've called
+`bug()`.
++
+This is for the convenience of APIs who'd like to potentially report
+more than one "bug", such as the optbug() validation in
+parse-options.c.
- `die` is for fatal application errors. It prints a message to
the user and exits with status 128.
@@ -20,6 +45,9 @@ kinds.
without running into too many problems. Like `error`, it
returns -1 after reporting the situation to the caller.
+These reports will be logged via the trace2 facility. See the "error"
+event in link:api-trace2.html[trace2 API].
+
Customizable error handlers
---------------------------
diff --git a/Documentation/technical/api-gitattributes.txt b/Documentation/technical/api-gitattributes.txt
deleted file mode 100644
index 45f0df6..0000000
--- a/Documentation/technical/api-gitattributes.txt
+++ /dev/null
@@ -1,154 +0,0 @@
-gitattributes API
-=================
-
-gitattributes mechanism gives a uniform way to associate various
-attributes to set of paths.
-
-
-Data Structure
---------------
-
-`struct git_attr`::
-
- An attribute is an opaque object that is identified by its name.
- Pass the name to `git_attr()` function to obtain the object of
- this type. The internal representation of this structure is
- of no interest to the calling programs. The name of the
- attribute can be retrieved by calling `git_attr_name()`.
-
-`struct attr_check_item`::
-
- This structure represents one attribute and its value.
-
-`struct attr_check`::
-
- This structure represents a collection of `attr_check_item`.
- It is passed to `git_check_attr()` function, specifying the
- attributes to check, and receives their values.
-
-
-Attribute Values
-----------------
-
-An attribute for a path can be in one of four states: Set, Unset,
-Unspecified or set to a string, and `.value` member of `struct
-attr_check_item` records it. There are three macros to check these:
-
-`ATTR_TRUE()`::
-
- Returns true if the attribute is Set for the path.
-
-`ATTR_FALSE()`::
-
- Returns true if the attribute is Unset for the path.
-
-`ATTR_UNSET()`::
-
- Returns true if the attribute is Unspecified for the path.
-
-If none of the above returns true, `.value` member points at a string
-value of the attribute for the path.
-
-
-Querying Specific Attributes
-----------------------------
-
-* Prepare `struct attr_check` using attr_check_initl()
- function, enumerating the names of attributes whose values you are
- interested in, terminated with a NULL pointer. Alternatively, an
- empty `struct attr_check` can be prepared by calling
- `attr_check_alloc()` function and then attributes you want to
- ask about can be added to it with `attr_check_append()`
- function.
-
-* Call `git_check_attr()` to check the attributes for the path.
-
-* Inspect `attr_check` structure to see how each of the
- attribute in the array is defined for the path.
-
-
-Example
--------
-
-To see how attributes "crlf" and "ident" are set for different paths.
-
-. Prepare a `struct attr_check` with two elements (because
- we are checking two attributes):
-
-------------
-static struct attr_check *check;
-static void setup_check(void)
-{
- if (check)
- return; /* already done */
- check = attr_check_initl("crlf", "ident", NULL);
-}
-------------
-
-. Call `git_check_attr()` with the prepared `struct attr_check`:
-
-------------
- const char *path;
-
- setup_check();
- git_check_attr(path, check);
-------------
-
-. Act on `.value` member of the result, left in `check->items[]`:
-
-------------
- const char *value = check->items[0].value;
-
- if (ATTR_TRUE(value)) {
- The attribute is Set, by listing only the name of the
- attribute in the gitattributes file for the path.
- } else if (ATTR_FALSE(value)) {
- The attribute is Unset, by listing the name of the
- attribute prefixed with a dash - for the path.
- } else if (ATTR_UNSET(value)) {
- The attribute is neither set nor unset for the path.
- } else if (!strcmp(value, "input")) {
- If none of ATTR_TRUE(), ATTR_FALSE(), or ATTR_UNSET() is
- true, the value is a string set in the gitattributes
- file for the path by saying "attr=value".
- } else if (... other check using value as string ...) {
- ...
- }
-------------
-
-To see how attributes in argv[] are set for different paths, only
-the first step in the above would be different.
-
-------------
-static struct attr_check *check;
-static void setup_check(const char **argv)
-{
- check = attr_check_alloc();
- while (*argv) {
- struct git_attr *attr = git_attr(*argv);
- attr_check_append(check, attr);
- argv++;
- }
-}
-------------
-
-
-Querying All Attributes
------------------------
-
-To get the values of all attributes associated with a file:
-
-* Prepare an empty `attr_check` structure by calling
- `attr_check_alloc()`.
-
-* Call `git_all_attrs()`, which populates the `attr_check`
- with the attributes attached to the path.
-
-* Iterate over the `attr_check.items[]` array to examine
- the attribute names and values. The name of the attribute
- described by an `attr_check.items[]` object can be retrieved via
- `git_attr_name(check->items[i].attr)`. (Please note that no items
- will be returned for unset attributes, so `ATTR_UNSET()` will return
- false for all returned `attr_check.items[]` objects.)
-
-* Free the `attr_check` struct by calling `attr_check_free()`.
diff --git a/Documentation/technical/api-grep.txt b/Documentation/technical/api-grep.txt
deleted file mode 100644
index a69cc89..0000000
--- a/Documentation/technical/api-grep.txt
+++ /dev/null
@@ -1,8 +0,0 @@
-grep API
-========
-
-Talk about <grep.h>, things like:
-
-* grep_buffer()
-
-(JC)
diff --git a/Documentation/technical/api-history-graph.txt b/Documentation/technical/api-history-graph.txt
deleted file mode 100644
index d0d1707..0000000
--- a/Documentation/technical/api-history-graph.txt
+++ /dev/null
@@ -1,173 +0,0 @@
-history graph API
-=================
-
-The graph API is used to draw a text-based representation of the commit
-history. The API generates the graph in a line-by-line fashion.
-
-Functions
----------
-
-Core functions:
-
-* `graph_init()` creates a new `struct git_graph`
-
-* `graph_update()` moves the graph to a new commit.
-
-* `graph_next_line()` outputs the next line of the graph into a strbuf. It
- does not add a terminating newline.
-
-* `graph_padding_line()` outputs a line of vertical padding in the graph. It
- is similar to `graph_next_line()`, but is guaranteed to never print the line
- containing the current commit. Where `graph_next_line()` would print the
- commit line next, `graph_padding_line()` prints a line that simply extends
- all branch lines downwards one row, leaving their positions unchanged.
-
-* `graph_is_commit_finished()` determines if the graph has output all lines
- necessary for the current commit. If `graph_update()` is called before all
- lines for the current commit have been printed, the next call to
- `graph_next_line()` will output an ellipsis, to indicate that a portion of
- the graph was omitted.
-
-The following utility functions are wrappers around `graph_next_line()` and
-`graph_is_commit_finished()`. They always print the output to stdout.
-They can all be called with a NULL graph argument, in which case no graph
-output will be printed.
-
-* `graph_show_commit()` calls `graph_next_line()` and
- `graph_is_commit_finished()` until one of them return non-zero. This prints
- all graph lines up to, and including, the line containing this commit.
- Output is printed to stdout. The last line printed does not contain a
- terminating newline.
-
-* `graph_show_oneline()` calls `graph_next_line()` and prints the result to
- stdout. The line printed does not contain a terminating newline.
-
-* `graph_show_padding()` calls `graph_padding_line()` and prints the result to
- stdout. The line printed does not contain a terminating newline.
-
-* `graph_show_remainder()` calls `graph_next_line()` until
- `graph_is_commit_finished()` returns non-zero. Output is printed to stdout.
- The last line printed does not contain a terminating newline. Returns 1 if
- output was printed, and 0 if no output was necessary.
-
-* `graph_show_strbuf()` prints the specified strbuf to stdout, prefixing all
- lines but the first with a graph line. The caller is responsible for
- ensuring graph output for the first line has already been printed to stdout.
- (This can be done with `graph_show_commit()` or `graph_show_oneline()`.) If
- a NULL graph is supplied, the strbuf is printed as-is.
-
-* `graph_show_commit_msg()` is similar to `graph_show_strbuf()`, but it also
- prints the remainder of the graph, if more lines are needed after the strbuf
- ends. It is better than directly calling `graph_show_strbuf()` followed by
- `graph_show_remainder()` since it properly handles buffers that do not end in
- a terminating newline. The output printed by `graph_show_commit_msg()` will
- end in a newline if and only if the strbuf ends in a newline.
-
-Data structure
---------------
-`struct git_graph` is an opaque data type used to store the current graph
-state.
-
-Calling sequence
-----------------
-
-* Create a `struct git_graph` by calling `graph_init()`. When using the
- revision walking API, this is done automatically by `setup_revisions()` if
- the '--graph' option is supplied.
-
-* Use the revision walking API to walk through a group of contiguous commits.
- The `get_revision()` function automatically calls `graph_update()` each time
- it is invoked.
-
-* For each commit, call `graph_next_line()` repeatedly, until
- `graph_is_commit_finished()` returns non-zero. Each call to
- `graph_next_line()` will output a single line of the graph. The resulting
- lines will not contain any newlines. `graph_next_line()` returns 1 if the
- resulting line contains the current commit, or 0 if this is merely a line
- needed to adjust the graph before or after the current commit. This return
- value can be used to determine where to print the commit summary information
- alongside the graph output.
-
-Limitations
------------
-
-* `graph_update()` must be called with commits in topological order. It should
- not be called on a commit if it has already been invoked with an ancestor of
- that commit, or the graph output will be incorrect.
-
-* `graph_update()` must be called on a contiguous group of commits. If
- `graph_update()` is called on a particular commit, it should later be called
- on all parents of that commit. Parents must not be skipped, or the graph
- output will appear incorrect.
-+
-`graph_update()` may be used on a pruned set of commits only if the parent list
-has been rewritten so as to include only ancestors from the pruned set.
-
-* The graph API does not currently support reverse commit ordering. In
- order to implement reverse ordering, the graphing API needs an
- (efficient) mechanism to find the children of a commit.
-
-Sample usage
-------------
-
-------------
-struct commit *commit;
-struct git_graph *graph = graph_init(opts);
-
-while ((commit = get_revision(opts)) != NULL) {
- while (!graph_is_commit_finished(graph))
- {
- struct strbuf sb;
- int is_commit_line;
-
- strbuf_init(&sb, 0);
- is_commit_line = graph_next_line(graph, &sb);
- fputs(sb.buf, stdout);
-
- if (is_commit_line)
- log_tree_commit(opts, commit);
- else
- putchar(opts->diffopt.line_termination);
- }
-}
-------------
-
-Sample output
--------------
-
-The following is an example of the output from the graph API. This output does
-not include any commit summary information--callers are responsible for
-outputting that information, if desired.
-
-------------
-*
-*
-*
-|\
-* |
-| | *
-| \ \
-| \ \
-*-. \ \
-|\ \ \ \
-| | * | |
-| | | | | *
-| | | | | *
-| | | | | *
-| | | | | |\
-| | | | | | *
-| * | | | | |
-| | | | | * \
-| | | | | |\ |
-| | | | * | | |
-| | | | * | | |
-* | | | | | | |
-| |/ / / / / /
-|/| / / / / /
-* | | | | | |
-|/ / / / / /
-* | | | | |
-| | | | | *
-| | | | |/
-| | | | *
-------------
diff --git a/Documentation/technical/api-index-skel.txt b/Documentation/technical/api-index-skel.txt
index eda8c19..7780a76 100644
--- a/Documentation/technical/api-index-skel.txt
+++ b/Documentation/technical/api-index-skel.txt
@@ -1,7 +1,7 @@
Git API Documents
=================
-Git has grown a set of internal API over time. This collection
+Git has grown a set of internal APIs over time. This collection
documents them.
////////////////////////////////////////////////////////////////
diff --git a/Documentation/technical/api-merge.txt b/Documentation/technical/api-merge.txt
index 9dc1bed..c2ba018 100644
--- a/Documentation/technical/api-merge.txt
+++ b/Documentation/technical/api-merge.txt
@@ -28,77 +28,9 @@ and `diff.c` for examples.
* `struct ll_merge_options`
-This describes the set of options the calling program wants to affect
-the operation of a low-level (single file) merge. Some options:
-
-`virtual_ancestor`::
- Behave as though this were part of a merge between common
- ancestors in a recursive merge.
- If a helper program is specified by the
- `[merge "<driver>"] recursive` configuration, it will
- be used (see linkgit:gitattributes[5]).
-
-`variant`::
- Resolve local conflicts automatically in favor
- of one side or the other (as in 'git merge-file'
- `--ours`/`--theirs`/`--union`). Can be `0`,
- `XDL_MERGE_FAVOR_OURS`, `XDL_MERGE_FAVOR_THEIRS`, or
- `XDL_MERGE_FAVOR_UNION`.
-
-`renormalize`::
- Resmudge and clean the "base", "theirs" and "ours" files
- before merging. Use this when the merge is likely to have
- overlapped with a change in smudge/clean or end-of-line
- normalization rules.
+Check merge-ll.h for details.
Low-level (single file) merge
-----------------------------
-`ll_merge`::
-
- Perform a three-way single-file merge in core. This is
- a thin wrapper around `xdl_merge` that takes the path and
- any merge backend specified in `.gitattributes` or
- `.git/info/attributes` into account. Returns 0 for a
- clean merge.
-
-Calling sequence:
-
-* Prepare a `struct ll_merge_options` to record options.
- If you have no special requests, skip this and pass `NULL`
- as the `opts` parameter to use the default options.
-
-* Allocate an mmbuffer_t variable for the result.
-
-* Allocate and fill variables with the file's original content
- and two modified versions (using `read_mmfile`, for example).
-
-* Call `ll_merge()`.
-
-* Read the merged content from `result_buf.ptr` and `result_buf.size`.
-
-* Release buffers when finished. A simple
- `free(ancestor.ptr); free(ours.ptr); free(theirs.ptr);
- free(result_buf.ptr);` will do.
-
-If the modifications do not merge cleanly, `ll_merge` will return a
-nonzero value and `result_buf` will generally include a description of
-the conflict bracketed by markers such as the traditional `<<<<<<<`
-and `>>>>>>>`.
-
-The `ancestor_label`, `our_label`, and `their_label` parameters are
-used to label the different sides of a conflict if the merge driver
-supports this.
-
-Everything else
----------------
-
-Talk about <merge-recursive.h> and merge_file():
-
- - merge_trees() to merge with rename detection
- - merge_recursive() for ancestor consolidation
- - try_merge_command() for other strategies
- - conflict format
- - merge options
-
-(Daniel, Miklos, Stephan, JC)
+Check merge-ll.h for details.
diff --git a/Documentation/technical/api-object-access.txt b/Documentation/technical/api-object-access.txt
deleted file mode 100644
index 5b29622..0000000
--- a/Documentation/technical/api-object-access.txt
+++ /dev/null
@@ -1,15 +0,0 @@
-object access API
-=================
-
-Talk about <sha1-file.c> and <object.h> family, things like
-
-* read_sha1_file()
-* read_object_with_reference()
-* has_sha1_file()
-* write_sha1_file()
-* pretend_object_file()
-* lookup_{object,commit,tag,blob,tree}
-* parse_{object,commit,tag,blob,tree}
-* Use of object flags
-
-(JC, Shawn, Daniel, Dscho, Linus)
diff --git a/Documentation/technical/api-oid-array.txt b/Documentation/technical/api-oid-array.txt
deleted file mode 100644
index c97428c..0000000
--- a/Documentation/technical/api-oid-array.txt
+++ /dev/null
@@ -1,90 +0,0 @@
-oid-array API
-==============
-
-The oid-array API provides storage and manipulation of sets of object
-identifiers. The emphasis is on storage and processing efficiency,
-making them suitable for large lists. Note that the ordering of items is
-not preserved over some operations.
-
-Data Structures
----------------
-
-`struct oid_array`::
-
- A single array of object IDs. This should be initialized by
- assignment from `OID_ARRAY_INIT`. The `oid` member contains
- the actual data. The `nr` member contains the number of items in
- the set. The `alloc` and `sorted` members are used internally,
- and should not be needed by API callers.
-
-Functions
----------
-
-`oid_array_append`::
- Add an item to the set. The object ID will be placed at the end of
- the array (but note that some operations below may lose this
- ordering).
-
-`oid_array_lookup`::
- Perform a binary search of the array for a specific object ID.
- If found, returns the offset (in number of elements) of the
- object ID. If not found, returns a negative integer. If the array
- is not sorted, this function has the side effect of sorting it.
-
-`oid_array_clear`::
- Free all memory associated with the array and return it to the
- initial, empty state.
-
-`oid_array_for_each`::
- Iterate over each element of the list, executing the callback
- function for each one. Does not sort the list, so any custom
- hash order is retained. If the callback returns a non-zero
- value, the iteration ends immediately and the callback's
- return is propagated; otherwise, 0 is returned.
-
-`oid_array_for_each_unique`::
- Iterate over each unique element of the list in sorted order,
- but otherwise behave like `oid_array_for_each`. If the array
- is not sorted, this function has the side effect of sorting
- it.
-
-`oid_array_filter`::
- Apply the callback function `want` to each entry in the array,
- retaining only the entries for which the function returns true.
- Preserve the order of the entries that are retained.
-
-Examples
---------
-
------------------------------------------
-int print_callback(const struct object_id *oid,
- void *data)
-{
- printf("%s\n", oid_to_hex(oid));
- return 0; /* always continue */
-}
-
-void some_func(void)
-{
- struct sha1_array hashes = OID_ARRAY_INIT;
- struct object_id oid;
-
- /* Read objects into our set */
- while (read_object_from_stdin(oid.hash))
- oid_array_append(&hashes, &oid);
-
- /* Check if some objects are in our set */
- while (read_object_from_stdin(oid.hash)) {
- if (oid_array_lookup(&hashes, &oid) >= 0)
- printf("it's in there!\n");
-
- /*
- * Print the unique set of objects. We could also have
- * avoided adding duplicate objects in the first place,
- * but we would end up re-sorting the array repeatedly.
- * Instead, this will sort once and then skip duplicates
- * in linear time.
- */
- oid_array_for_each_unique(&hashes, print_callback, NULL);
-}
------------------------------------------
diff --git a/Documentation/technical/api-parse-options.txt b/Documentation/technical/api-parse-options.txt
index 2e2e7c1..61fa6ee 100644
--- a/Documentation/technical/api-parse-options.txt
+++ b/Documentation/technical/api-parse-options.txt
@@ -8,7 +8,8 @@ Basics
------
The argument vector `argv[]` may usually contain mandatory or optional
-'non-option arguments', e.g. a filename or a branch, and 'options'.
+'non-option arguments', e.g. a filename or a branch, 'options', and
+'subcommands'.
Options are optional arguments that start with a dash and
that allow to change the behavior of a command.
@@ -48,6 +49,33 @@ The parse-options API allows:
option, e.g. `-a -b --option -- --this-is-a-file` indicates that
`--this-is-a-file` must not be processed as an option.
+Subcommands are special in a couple of ways:
+
+* Subcommands only have long form, and they have no double dash prefix, no
+ negated form, and no description, and they don't take any arguments, and
+ can't be abbreviated.
+
+* There must be exactly one subcommand among the arguments, or zero if the
+ command has a default operation mode.
+
+* All arguments following the subcommand are considered to be arguments of
+ the subcommand, and, conversely, arguments meant for the subcommand may
+ not precede the subcommand.
+
+Therefore, if the options array contains at least one subcommand and
+`parse_options()` encounters the first dashless argument, it will either:
+
+* stop and return, if that dashless argument is a known subcommand, setting
+ `value` to the function pointer associated with that subcommand, storing
+ the name of the subcommand in argv[0], and leaving the rest of the
+ arguments unprocessed, or
+
+* stop and return, if it was invoked with the `PARSE_OPT_SUBCOMMAND_OPTIONAL`
+ flag and that dashless argument doesn't match any subcommands, leaving
+ `value` unchanged and the rest of the arguments unprocessed, or
+
+* show error and usage, and abort.
+
Steps to parse options
----------------------
@@ -90,8 +118,8 @@ Flags are the bitwise-or of:
Keep the first argument, which contains the program name. It's
removed from argv[] by default.
-`PARSE_OPT_KEEP_UNKNOWN`::
- Keep unknown arguments instead of erroring out. This doesn't
+`PARSE_OPT_KEEP_UNKNOWN_OPT`::
+ Keep unknown options instead of erroring out. This doesn't
work for all combinations of arguments as users might expect
it to do. E.g. if the first argument in `--unknown --known`
takes a value (which we can't know), the second one is
@@ -101,6 +129,8 @@ Flags are the bitwise-or of:
non-option, not as a value belonging to the unknown option,
the parser early. That's why parse_options() errors out if
both options are set.
+ Note that non-option arguments are always kept, even without
+ this flag.
`PARSE_OPT_NO_INTERNAL_HELP`::
By default, parse_options() handles `-h`, `--help` and
@@ -108,6 +138,13 @@ Flags are the bitwise-or of:
turns it off and allows one to add custom handlers for these
options, or to just leave them unknown.
+`PARSE_OPT_SUBCOMMAND_OPTIONAL`::
+ Don't error out when no subcommand is specified.
+
+Note that `PARSE_OPT_STOP_AT_NON_OPTION` is incompatible with subcommands;
+while `PARSE_OPT_KEEP_DASHDASH` and `PARSE_OPT_KEEP_UNKNOWN_OPT` can only be
+used with subcommands when combined with `PARSE_OPT_SUBCOMMAND_OPTIONAL`.
+
Data Structure
--------------
@@ -198,11 +235,6 @@ There are some macros to easily define options:
The filename will be prefixed by passing the filename along with
the prefix argument of `parse_options()` to `prefix_filename()`.
-`OPT_ARGUMENT(long, &int_var, description)`::
- Introduce a long-option argument that will be kept in `argv[]`.
- If this option was seen, `int_var` will be set to one (except
- if a `NULL` pointer was passed).
-
`OPT_NUMBER_CALLBACK(&var, description, func_ptr)`::
Recognize numerical options like -123 and feed the integer as
if it was an argument to the function given by `func_ptr`.
@@ -232,19 +264,23 @@ There are some macros to easily define options:
will be overwritten, so this should only be used for options where
the last one specified on the command line wins.
-`OPT_PASSTHRU_ARGV(short, long, &argv_array_var, arg_str, description, flags)`::
+`OPT_PASSTHRU_ARGV(short, long, &strvec_var, arg_str, description, flags)`::
Introduce an option where all instances of it on the command-line will
- be reconstructed into an argv_array. This is useful when you need to
+ be reconstructed into a strvec. This is useful when you need to
pass the command-line option, which can be specified multiple times,
to another command.
`OPT_CMDMODE(short, long, &int_var, description, enum_val)`::
Define an "operation mode" option, only one of which in the same
group of "operating mode" options that share the same `int_var`
- can be given by the user. `enum_val` is set to `int_var` when the
+ can be given by the user. `int_var` is set to `enum_val` when the
option is used, but an error is reported if other "operating mode"
option has already set its value to the same `int_var`.
+ In new commands consider using subcommands instead.
+`OPT_SUBCOMMAND(long, &fn_ptr, subcommand_fn)`::
+ Define a subcommand. `subcommand_fn` is put into `fn_ptr` when
+ this subcommand is used.
The last element of the array must be `OPT_END()`.
diff --git a/Documentation/technical/api-quote.txt b/Documentation/technical/api-quote.txt
deleted file mode 100644
index e8a1bce..0000000
--- a/Documentation/technical/api-quote.txt
+++ /dev/null
@@ -1,10 +0,0 @@
-quote API
-=========
-
-Talk about <quote.h>, things like
-
-* sq_quote and unquote
-* c_style quote and unquote
-* quoting for foreign languages
-
-(JC)
diff --git a/Documentation/technical/api-ref-iteration.txt b/Documentation/technical/api-ref-iteration.txt
deleted file mode 100644
index ad9d019..0000000
--- a/Documentation/technical/api-ref-iteration.txt
+++ /dev/null
@@ -1,78 +0,0 @@
-ref iteration API
-=================
-
-
-Iteration of refs is done by using an iterate function which will call a
-callback function for every ref. The callback function has this
-signature:
-
- int handle_one_ref(const char *refname, const struct object_id *oid,
- int flags, void *cb_data);
-
-There are different kinds of iterate functions which all take a
-callback of this type. The callback is then called for each found ref
-until the callback returns nonzero. The returned value is then also
-returned by the iterate function.
-
-Iteration functions
--------------------
-
-* `head_ref()` just iterates the head ref.
-
-* `for_each_ref()` iterates all refs.
-
-* `for_each_ref_in()` iterates all refs which have a defined prefix and
- strips that prefix from the passed variable refname.
-
-* `for_each_tag_ref()`, `for_each_branch_ref()`, `for_each_remote_ref()`,
- `for_each_replace_ref()` iterate refs from the respective area.
-
-* `for_each_glob_ref()` iterates all refs that match the specified glob
- pattern.
-
-* `for_each_glob_ref_in()` the previous and `for_each_ref_in()` combined.
-
-* Use `refs_` API for accessing submodules. The submodule ref store could
- be obtained with `get_submodule_ref_store()`.
-
-* `for_each_rawref()` can be used to learn about broken ref and symref.
-
-* `for_each_reflog()` iterates each reflog file.
-
-Submodules
-----------
-
-If you want to iterate the refs of a submodule you first need to add the
-submodules object database. You can do this by a code-snippet like
-this:
-
- const char *path = "path/to/submodule"
- if (add_submodule_odb(path))
- die("Error submodule '%s' not populated.", path);
-
-`add_submodule_odb()` will return zero on success. If you
-do not do this you will get an error for each ref that it does not point
-to a valid object.
-
-Note: As a side-effect of this you cannot safely assume that all
-objects you lookup are available in superproject. All submodule objects
-will be available the same way as the superprojects objects.
-
-Example:
---------
-
-----
-static int handle_remote_ref(const char *refname,
- const unsigned char *sha1, int flags, void *cb_data)
-{
- struct strbuf *output = cb_data;
- strbuf_addf(output, "%s\n", refname);
- return 0;
-}
-
-...
-
- struct strbuf output = STRBUF_INIT;
- for_each_remote_ref(handle_remote_ref, &output);
- printf("%s", output.buf);
-----
diff --git a/Documentation/technical/api-remote.txt b/Documentation/technical/api-remote.txt
deleted file mode 100644
index f10941b..0000000
--- a/Documentation/technical/api-remote.txt
+++ /dev/null
@@ -1,127 +0,0 @@
-Remotes configuration API
-=========================
-
-The API in remote.h gives access to the configuration related to
-remotes. It handles all three configuration mechanisms historically
-and currently used by Git, and presents the information in a uniform
-fashion. Note that the code also handles plain URLs without any
-configuration, giving them just the default information.
-
-struct remote
--------------
-
-`name`::
-
- The user's nickname for the remote
-
-`url`::
-
- An array of all of the url_nr URLs configured for the remote
-
-`pushurl`::
-
- An array of all of the pushurl_nr push URLs configured for the remote
-
-`push`::
-
- An array of refspecs configured for pushing, with
- push_refspec being the literal strings, and push_refspec_nr
- being the quantity.
-
-`fetch`::
-
- An array of refspecs configured for fetching, with
- fetch_refspec being the literal strings, and fetch_refspec_nr
- being the quantity.
-
-`fetch_tags`::
-
- The setting for whether to fetch tags (as a separate rule from
- the configured refspecs); -1 means never to fetch tags, 0
- means to auto-follow tags based on the default heuristic, 1
- means to always auto-follow tags, and 2 means to fetch all
- tags.
-
-`receivepack`, `uploadpack`::
-
- The configured helper programs to run on the remote side, for
- Git-native protocols.
-
-`http_proxy`::
-
- The proxy to use for curl (http, https, ftp, etc.) URLs.
-
-`http_proxy_authmethod`::
-
- The method used for authenticating against `http_proxy`.
-
-struct remotes can be found by name with remote_get(), and iterated
-through with for_each_remote(). remote_get(NULL) will return the
-default remote, given the current branch and configuration.
-
-struct refspec
---------------
-
-A struct refspec holds the parsed interpretation of a refspec. If it
-will force updates (starts with a '+'), force is true. If it is a
-pattern (sides end with '*') pattern is true. src and dest are the
-two sides (including '*' characters if present); if there is only one
-side, it is src, and dst is NULL; if sides exist but are empty (i.e.,
-the refspec either starts or ends with ':'), the corresponding side is
-"".
-
-An array of strings can be parsed into an array of struct refspecs
-using parse_fetch_refspec() or parse_push_refspec().
-
-remote_find_tracking(), given a remote and a struct refspec with
-either src or dst filled out, will fill out the other such that the
-result is in the "fetch" specification for the remote (note that this
-evaluates patterns and returns a single result).
-
-struct branch
--------------
-
-Note that this may end up moving to branch.h
-
-struct branch holds the configuration for a branch. It can be looked
-up with branch_get(name) for "refs/heads/{name}", or with
-branch_get(NULL) for HEAD.
-
-It contains:
-
-`name`::
-
- The short name of the branch.
-
-`refname`::
-
- The full path for the branch ref.
-
-`remote_name`::
-
- The name of the remote listed in the configuration.
-
-`merge_name`::
-
- An array of the "merge" lines in the configuration.
-
-`merge`::
-
- An array of the struct refspecs used for the merge lines. That
- is, merge[i]->dst is a local tracking ref which should be
- merged into this branch by default.
-
-`merge_nr`::
-
- The number of merge configurations
-
-branch_has_merge_config() returns true if the given branch has merge
-configuration given.
-
-Other stuff
------------
-
-There is other stuff in remote.h that is related, in general, to the
-process of interacting with remotes.
-
-(Daniel Barkalow)
diff --git a/Documentation/technical/api-revision-walking.txt b/Documentation/technical/api-revision-walking.txt
deleted file mode 100644
index 03f9ea6..0000000
--- a/Documentation/technical/api-revision-walking.txt
+++ /dev/null
@@ -1,72 +0,0 @@
-revision walking API
-====================
-
-The revision walking API offers functions to build a list of revisions
-and then iterate over that list.
-
-Calling sequence
-----------------
-
-The walking API has a given calling sequence: first you need to
-initialize a rev_info structure, then add revisions to control what kind
-of revision list do you want to get, finally you can iterate over the
-revision list.
-
-Functions
----------
-
-`repo_init_revisions`::
-
- Initialize a rev_info structure with default values. The third
- parameter may be NULL or can be prefix path, and then the `.prefix`
- variable will be set to it. This is typically the first function you
- want to call when you want to deal with a revision list. After calling
- this function, you are free to customize options, like set
- `.ignore_merges` to 0 if you don't want to ignore merges, and so on. See
- `revision.h` for a complete list of available options.
-
-`add_pending_object`::
-
- This function can be used if you want to add commit objects as revision
- information. You can use the `UNINTERESTING` object flag to indicate if
- you want to include or exclude the given commit (and commits reachable
- from the given commit) from the revision list.
-+
-NOTE: If you have the commits as a string list then you probably want to
-use setup_revisions(), instead of parsing each string and using this
-function.
-
-`setup_revisions`::
-
- Parse revision information, filling in the `rev_info` structure, and
- removing the used arguments from the argument list. Returns the number
- of arguments left that weren't recognized, which are also moved to the
- head of the argument list. The last parameter is used in case no
- parameter given by the first two arguments.
-
-`prepare_revision_walk`::
-
- Prepares the rev_info structure for a walk. You should check if it
- returns any error (non-zero return code) and if it does not, you can
- start using get_revision() to do the iteration.
-
-`get_revision`::
-
- Takes a pointer to a `rev_info` structure and iterates over it,
- returning a `struct commit *` each time you call it. The end of the
- revision list is indicated by returning a NULL pointer.
-
-`reset_revision_walk`::
-
- Reset the flags used by the revision walking api. You can use
- this to do multiple sequential revision walks.
-
-Data structures
----------------
-
-Talk about <revision.h>, things like:
-
-* two diff_options, one for path limiting, another for output;
-* remaining functions;
-
-(Linus, JC, Dscho)
diff --git a/Documentation/technical/api-run-command.txt b/Documentation/technical/api-run-command.txt
deleted file mode 100644
index 8bf3e37..0000000
--- a/Documentation/technical/api-run-command.txt
+++ /dev/null
@@ -1,264 +0,0 @@
-run-command API
-===============
-
-The run-command API offers a versatile tool to run sub-processes with
-redirected input and output as well as with a modified environment
-and an alternate current directory.
-
-A similar API offers the capability to run a function asynchronously,
-which is primarily used to capture the output that the function
-produces in the caller in order to process it.
-
-
-Functions
----------
-
-`child_process_init`::
-
- Initialize a struct child_process variable.
-
-`start_command`::
-
- Start a sub-process. Takes a pointer to a `struct child_process`
- that specifies the details and returns pipe FDs (if requested).
- See below for details.
-
-`finish_command`::
-
- Wait for the completion of a sub-process that was started with
- start_command().
-
-`run_command`::
-
- A convenience function that encapsulates a sequence of
- start_command() followed by finish_command(). Takes a pointer
- to a `struct child_process` that specifies the details.
-
-`run_command_v_opt`, `run_command_v_opt_cd_env`::
-
- Convenience functions that encapsulate a sequence of
- start_command() followed by finish_command(). The argument argv
- specifies the program and its arguments. The argument opt is zero
- or more of the flags `RUN_COMMAND_NO_STDIN`, `RUN_GIT_CMD`,
- `RUN_COMMAND_STDOUT_TO_STDERR`, or `RUN_SILENT_EXEC_FAILURE`
- that correspond to the members .no_stdin, .git_cmd,
- .stdout_to_stderr, .silent_exec_failure of `struct child_process`.
- The argument dir corresponds the member .dir. The argument env
- corresponds to the member .env.
-
-`child_process_clear`::
-
- Release the memory associated with the struct child_process.
- Most users of the run-command API don't need to call this
- function explicitly because `start_command` invokes it on
- failure and `finish_command` calls it automatically already.
-
-The functions above do the following:
-
-. If a system call failed, errno is set and -1 is returned. A diagnostic
- is printed.
-
-. If the program was not found, then -1 is returned and errno is set to
- ENOENT; a diagnostic is printed only if .silent_exec_failure is 0.
-
-. Otherwise, the program is run. If it terminates regularly, its exit
- code is returned. No diagnostic is printed, even if the exit code is
- non-zero.
-
-. If the program terminated due to a signal, then the return value is the
- signal number + 128, ie. the same value that a POSIX shell's $? would
- report. A diagnostic is printed.
-
-
-`start_async`::
-
- Run a function asynchronously. Takes a pointer to a `struct
- async` that specifies the details and returns a set of pipe FDs
- for communication with the function. See below for details.
-
-`finish_async`::
-
- Wait for the completion of an asynchronous function that was
- started with start_async().
-
-`run_hook`::
-
- Run a hook.
- The first argument is a pathname to an index file, or NULL
- if the hook uses the default index file or no index is needed.
- The second argument is the name of the hook.
- The further arguments correspond to the hook arguments.
- The last argument has to be NULL to terminate the arguments list.
- If the hook does not exist or is not executable, the return
- value will be zero.
- If it is executable, the hook will be executed and the exit
- status of the hook is returned.
- On execution, .stdout_to_stderr and .no_stdin will be set.
- (See below.)
-
-
-Data structures
----------------
-
-* `struct child_process`
-
-This describes the arguments, redirections, and environment of a
-command to run in a sub-process.
-
-The caller:
-
-1. allocates and clears (using child_process_init() or
- CHILD_PROCESS_INIT) a struct child_process variable;
-2. initializes the members;
-3. calls start_command();
-4. processes the data;
-5. closes file descriptors (if necessary; see below);
-6. calls finish_command().
-
-The .argv member is set up as an array of string pointers (NULL
-terminated), of which .argv[0] is the program name to run (usually
-without a path). If the command to run is a git command, set argv[0] to
-the command name without the 'git-' prefix and set .git_cmd = 1.
-
-Note that the ownership of the memory pointed to by .argv stays with the
-caller, but it should survive until `finish_command` completes. If the
-.argv member is NULL, `start_command` will point it at the .args
-`argv_array` (so you may use one or the other, but you must use exactly
-one). The memory in .args will be cleaned up automatically during
-`finish_command` (or during `start_command` when it is unsuccessful).
-
-The members .in, .out, .err are used to redirect stdin, stdout,
-stderr as follows:
-
-. Specify 0 to request no special redirection. No new file descriptor
- is allocated. The child process simply inherits the channel from the
- parent.
-
-. Specify -1 to have a pipe allocated; start_command() replaces -1
- by the pipe FD in the following way:
-
- .in: Returns the writable pipe end into which the caller writes;
- the readable end of the pipe becomes the child's stdin.
-
- .out, .err: Returns the readable pipe end from which the caller
- reads; the writable end of the pipe end becomes child's
- stdout/stderr.
-
- The caller of start_command() must close the so returned FDs
- after it has completed reading from/writing to it!
-
-. Specify a file descriptor > 0 to be used by the child:
-
- .in: The FD must be readable; it becomes child's stdin.
- .out: The FD must be writable; it becomes child's stdout.
- .err: The FD must be writable; it becomes child's stderr.
-
- The specified FD is closed by start_command(), even if it fails to
- run the sub-process!
-
-. Special forms of redirection are available by setting these members
- to 1:
-
- .no_stdin, .no_stdout, .no_stderr: The respective channel is
- redirected to /dev/null.
-
- .stdout_to_stderr: stdout of the child is redirected to its
- stderr. This happens after stderr is itself redirected.
- So stdout will follow stderr to wherever it is
- redirected.
-
-To modify the environment of the sub-process, specify an array of
-string pointers (NULL terminated) in .env:
-
-. If the string is of the form "VAR=value", i.e. it contains '='
- the variable is added to the child process's environment.
-
-. If the string does not contain '=', it names an environment
- variable that will be removed from the child process's environment.
-
-If the .env member is NULL, `start_command` will point it at the
-.env_array `argv_array` (so you may use one or the other, but not both).
-The memory in .env_array will be cleaned up automatically during
-`finish_command` (or during `start_command` when it is unsuccessful).
-
-To specify a new initial working directory for the sub-process,
-specify it in the .dir member.
-
-If the program cannot be found, the functions return -1 and set
-errno to ENOENT. Normally, an error message is printed, but if
-.silent_exec_failure is set to 1, no message is printed for this
-special error condition.
-
-
-* `struct async`
-
-This describes a function to run asynchronously, whose purpose is
-to produce output that the caller reads.
-
-The caller:
-
-1. allocates and clears (memset(&asy, 0, sizeof(asy));) a
- struct async variable;
-2. initializes .proc and .data;
-3. calls start_async();
-4. processes communicates with proc through .in and .out;
-5. closes .in and .out;
-6. calls finish_async().
-
-The members .in, .out are used to provide a set of fd's for
-communication between the caller and the callee as follows:
-
-. Specify 0 to have no file descriptor passed. The callee will
- receive -1 in the corresponding argument.
-
-. Specify < 0 to have a pipe allocated; start_async() replaces
- with the pipe FD in the following way:
-
- .in: Returns the writable pipe end into which the caller
- writes; the readable end of the pipe becomes the function's
- in argument.
-
- .out: Returns the readable pipe end from which the caller
- reads; the writable end of the pipe becomes the function's
- out argument.
-
- The caller of start_async() must close the returned FDs after it
- has completed reading from/writing from them.
-
-. Specify a file descriptor > 0 to be used by the function:
-
- .in: The FD must be readable; it becomes the function's in.
- .out: The FD must be writable; it becomes the function's out.
-
- The specified FD is closed by start_async(), even if it fails to
- run the function.
-
-The function pointer in .proc has the following signature:
-
- int proc(int in, int out, void *data);
-
-. in, out specifies a set of file descriptors to which the function
- must read/write the data that it needs/produces. The function
- *must* close these descriptors before it returns. A descriptor
- may be -1 if the caller did not configure a descriptor for that
- direction.
-
-. data is the value that the caller has specified in the .data member
- of struct async.
-
-. The return value of the function is 0 on success and non-zero
- on failure. If the function indicates failure, finish_async() will
- report failure as well.
-
-
-There are serious restrictions on what the asynchronous function can do
-because this facility is implemented by a thread in the same address
-space on most platforms (when pthreads is available), but by a pipe to
-a forked process otherwise:
-
-. It cannot change the program's state (global variables, environment,
- etc.) in a way that the caller notices; in other words, .in and .out
- are the only communication channels to the caller.
-
-. It must not change the program's state that the caller of the
- facility also uses.
diff --git a/Documentation/technical/api-setup.txt b/Documentation/technical/api-setup.txt
deleted file mode 100644
index eb1fa98..0000000
--- a/Documentation/technical/api-setup.txt
+++ /dev/null
@@ -1,47 +0,0 @@
-setup API
-=========
-
-Talk about
-
-* setup_git_directory()
-* setup_git_directory_gently()
-* is_inside_git_dir()
-* is_inside_work_tree()
-* setup_work_tree()
-
-(Dscho)
-
-Pathspec
---------
-
-See glossary-context.txt for the syntax of pathspec. In memory, a
-pathspec set is represented by "struct pathspec" and is prepared by
-parse_pathspec(). This function takes several arguments:
-
-- magic_mask specifies what features that are NOT supported by the
- following code. If a user attempts to use such a feature,
- parse_pathspec() can reject it early.
-
-- flags specifies other things that the caller wants parse_pathspec to
- perform.
-
-- prefix and args come from cmd_* functions
-
-parse_pathspec() helps catch unsupported features and reject them
-politely. At a lower level, different pathspec-related functions may
-not support the same set of features. Such pathspec-sensitive
-functions are guarded with GUARD_PATHSPEC(), which will die in an
-unfriendly way when an unsupported feature is requested.
-
-The command designers are supposed to make sure that GUARD_PATHSPEC()
-never dies. They have to make sure all unsupported features are caught
-by parse_pathspec(), not by GUARD_PATHSPEC. grepping GUARD_PATHSPEC()
-should give the designers all pathspec-sensitive codepaths and what
-features they support.
-
-A similar process is applied when a new pathspec magic is added. The
-designer lifts the GUARD_PATHSPEC restriction in the functions that
-support the new magic. At the same time (s)he has to make sure this
-new feature will be caught at parse_pathspec() in commands that cannot
-handle the new magic in some cases. grepping parse_pathspec() should
-help.
diff --git a/Documentation/technical/api-sigchain.txt b/Documentation/technical/api-sigchain.txt
deleted file mode 100644
index 9e1189e..0000000
--- a/Documentation/technical/api-sigchain.txt
+++ /dev/null
@@ -1,41 +0,0 @@
-sigchain API
-============
-
-Code often wants to set a signal handler to clean up temporary files or
-other work-in-progress when we die unexpectedly. For multiple pieces of
-code to do this without conflicting, each piece of code must remember
-the old value of the handler and restore it either when:
-
- 1. The work-in-progress is finished, and the handler is no longer
- necessary. The handler should revert to the original behavior
- (either another handler, SIG_DFL, or SIG_IGN).
-
- 2. The signal is received. We should then do our cleanup, then chain
- to the next handler (or die if it is SIG_DFL).
-
-Sigchain is a tiny library for keeping a stack of handlers. Your handler
-and installation code should look something like:
-
-------------------------------------------
- void clean_foo_on_signal(int sig)
- {
- clean_foo();
- sigchain_pop(sig);
- raise(sig);
- }
-
- void other_func()
- {
- sigchain_push_common(clean_foo_on_signal);
- mess_up_foo();
- clean_foo();
- }
-------------------------------------------
-
-Handlers are given the typedef of sigchain_fun. This is the same type
-that is given to signal() or sigaction(). It is perfectly reasonable to
-push SIG_DFL or SIG_IGN onto the stack.
-
-You can sigchain_push and sigchain_pop individual signals. For
-convenience, sigchain_push_common will push the handler onto the stack
-for many common signals.
diff --git a/Documentation/technical/api-simple-ipc.txt b/Documentation/technical/api-simple-ipc.txt
new file mode 100644
index 0000000..c4fb152
--- /dev/null
+++ b/Documentation/technical/api-simple-ipc.txt
@@ -0,0 +1,105 @@
+Simple-IPC API
+==============
+
+The Simple-IPC API is a collection of `ipc_` prefixed library routines
+and a basic communication protocol that allows an IPC-client process to
+send an application-specific IPC-request message to an IPC-server
+process and receive an application-specific IPC-response message.
+
+Communication occurs over a named pipe on Windows and a Unix domain
+socket on other platforms. IPC-clients and IPC-servers rendezvous at
+a previously agreed-to application-specific pathname (which is outside
+the scope of this design) that is local to the computer system.
+
+The IPC-server routines within the server application process create a
+thread pool to listen for connections and receive request messages
+from multiple concurrent IPC-clients. When received, these messages
+are dispatched up to the server application callbacks for handling.
+IPC-server routines then incrementally relay responses back to the
+IPC-client.
+
+The IPC-client routines within a client application process connect
+to the IPC-server and send a request message and wait for a response.
+When received, the response is returned back to the caller.
+
+For example, the `fsmonitor--daemon` feature will be built as a server
+application on top of the IPC-server library routines. It will have
+threads watching for file system events and a thread pool waiting for
+client connections. Clients, such as `git status`, will request a list
+of file system events since a point in time and the server will
+respond with a list of changed files and directories. The formats of
+the request and response are application-specific; the IPC-client and
+IPC-server routines treat them as opaque byte streams.
+
+
+Comparison with sub-process model
+---------------------------------
+
+The Simple-IPC mechanism differs from the existing `sub-process.c`
+model (Documentation/technical/long-running-process-protocol.txt) and
+used by applications like Git-LFS. In the LFS-style sub-process model,
+the helper is started by the foreground process, communication happens
+via a pair of file descriptors bound to the stdin/stdout of the
+sub-process, the sub-process only serves the current foreground
+process, and the sub-process exits when the foreground process
+terminates.
+
+In the Simple-IPC model the server is a very long-running service. It
+can service many clients at the same time and has a private socket or
+named pipe connection to each active client. It might be started
+(on-demand) by the current client process or it might have been
+started by a previous client or by the OS at boot time. The server
+process is not associated with a terminal and it persists after
+clients terminate. Clients do not have access to the stdin/stdout of
+the server process and therefore must communicate over sockets or
+named pipes.
+
+
+Server startup and shutdown
+---------------------------
+
+How an application server based upon IPC-server is started is also
+outside the scope of the Simple-IPC design and is a property of the
+application using it. For example, the server might be started or
+restarted during routine maintenance operations, or it might be
+started as a system service during the system boot-up sequence, or it
+might be started on-demand by a foreground Git command when needed.
+
+Similarly, server shutdown is a property of the application using
+the simple-ipc routines. For example, the server might decide to
+shutdown when idle or only upon explicit request.
+
+
+Simple-IPC protocol
+-------------------
+
+The Simple-IPC protocol consists of a single request message from the
+client and an optional response message from the server. Both the
+client and server messages are unlimited in length and are terminated
+with a flush packet.
+
+The pkt-line routines (linkgit:gitprotocol-common[5])
+are used to simplify buffer management during message generation,
+transmission, and reception. A flush packet is used to mark the end
+of the message. This allows the sender to incrementally generate and
+transmit the message. It allows the receiver to incrementally receive
+the message in chunks and to know when they have received the entire
+message.
+
+The actual byte format of the client request and server response
+messages are application specific. The IPC layer transmits and
+receives them as opaque byte buffers without any concern for the
+content within. It is the job of the calling application layer to
+understand the contents of the request and response messages.
+
+
+Summary
+-------
+
+Conceptually, the Simple-IPC protocol is similar to an HTTP REST
+request. Clients connect, make an application-specific and
+stateless request, receive an application-specific
+response, and disconnect. It is a one round trip facility for
+querying the server. The Simple-IPC routines hide the socket,
+named pipe, and thread pool details and allow the application
+layer to focus on the task at hand.
diff --git a/Documentation/technical/api-submodule-config.txt b/Documentation/technical/api-submodule-config.txt
deleted file mode 100644
index fb06089..0000000
--- a/Documentation/technical/api-submodule-config.txt
+++ /dev/null
@@ -1,66 +0,0 @@
-submodule config cache API
-==========================
-
-The submodule config cache API allows to read submodule
-configurations/information from specified revisions. Internally
-information is lazily read into a cache that is used to avoid
-unnecessary parsing of the same .gitmodules files. Lookups can be done by
-submodule path or name.
-
-Usage
------
-
-To initialize the cache with configurations from the worktree the caller
-typically first calls `gitmodules_config()` to read values from the
-worktree .gitmodules and then to overlay the local git config values
-`parse_submodule_config_option()` from the config parsing
-infrastructure.
-
-The caller can look up information about submodules by using the
-`submodule_from_path()` or `submodule_from_name()` functions. They return
-a `struct submodule` which contains the values. The API automatically
-initializes and allocates the needed infrastructure on-demand. If the
-caller does only want to lookup values from revisions the initialization
-can be skipped.
-
-If the internal cache might grow too big or when the caller is done with
-the API, all internally cached values can be freed with submodule_free().
-
-Data Structures
----------------
-
-`struct submodule`::
-
- This structure is used to return the information about one
- submodule for a certain revision. It is returned by the lookup
- functions.
-
-Functions
----------
-
-`void submodule_free(struct repository *r)`::
-
- Use these to free the internally cached values.
-
-`int parse_submodule_config_option(const char *var, const char *value)`::
-
- Can be passed to the config parsing infrastructure to parse
- local (worktree) submodule configurations.
-
-`const struct submodule *submodule_from_path(const unsigned char *treeish_name, const char *path)`::
-
- Given a tree-ish in the superproject and a path, return the
- submodule that is bound at the path in the named tree.
-
-`const struct submodule *submodule_from_name(const unsigned char *treeish_name, const char *name)`::
-
- The same as above but lookup by name.
-
-Whenever a submodule configuration is parsed in `parse_submodule_config_option`
-via e.g. `gitmodules_config()`, it will overwrite the null_sha1 entry.
-So in the normal case, when HEAD:.gitmodules is parsed first and then overlayed
-with the repository configuration, the null_sha1 entry contains the local
-configuration of a submodule (e.g. consolidated values from local git
-configuration and the .gitmodules file in the worktree).
-
-For an example usage see test-submodule-config.c.
diff --git a/Documentation/technical/api-trace.txt b/Documentation/technical/api-trace.txt
deleted file mode 100644
index fadb597..0000000
--- a/Documentation/technical/api-trace.txt
+++ /dev/null
@@ -1,140 +0,0 @@
-trace API
-=========
-
-The trace API can be used to print debug messages to stderr or a file. Trace
-code is inactive unless explicitly enabled by setting `GIT_TRACE*` environment
-variables.
-
-The trace implementation automatically adds `timestamp file:line ... \n` to
-all trace messages. E.g.:
-
-------------
-23:59:59.123456 git.c:312 trace: built-in: git 'foo'
-00:00:00.000001 builtin/foo.c:99 foo: some message
-------------
-
-Data Structures
----------------
-
-`struct trace_key`::
-
- Defines a trace key (or category). The default (for API functions that
- don't take a key) is `GIT_TRACE`.
-+
-E.g. to define a trace key controlled by environment variable `GIT_TRACE_FOO`:
-+
-------------
-static struct trace_key trace_foo = TRACE_KEY_INIT(FOO);
-
-static void trace_print_foo(const char *message)
-{
- trace_printf_key(&trace_foo, "%s", message);
-}
-------------
-+
-Note: don't use `const` as the trace implementation stores internal state in
-the `trace_key` structure.
-
-Functions
----------
-
-`int trace_want(struct trace_key *key)`::
-
- Checks whether the trace key is enabled. Used to prevent expensive
- string formatting before calling one of the printing APIs.
-
-`void trace_disable(struct trace_key *key)`::
-
- Disables tracing for the specified key, even if the environment
- variable was set.
-
-`void trace_printf(const char *format, ...)`::
-`void trace_printf_key(struct trace_key *key, const char *format, ...)`::
-
- Prints a formatted message, similar to printf.
-
-`void trace_argv_printf(const char **argv, const char *format, ...)``::
-
- Prints a formatted message, followed by a quoted list of arguments.
-
-`void trace_strbuf(struct trace_key *key, const struct strbuf *data)`::
-
- Prints the strbuf, without additional formatting (i.e. doesn't
- choke on `%` or even `\0`).
-
-`uint64_t getnanotime(void)`::
-
- Returns nanoseconds since the epoch (01/01/1970), typically used
- for performance measurements.
-+
-Currently there are high precision timer implementations for Linux (using
-`clock_gettime(CLOCK_MONOTONIC)`) and Windows (`QueryPerformanceCounter`).
-Other platforms use `gettimeofday` as time source.
-
-`void trace_performance(uint64_t nanos, const char *format, ...)`::
-`void trace_performance_since(uint64_t start, const char *format, ...)`::
-
- Prints the elapsed time (in nanoseconds), or elapsed time since
- `start`, followed by a formatted message. Enabled via environment
- variable `GIT_TRACE_PERFORMANCE`. Used for manual profiling, e.g.:
-+
-------------
-uint64_t start = getnanotime();
-/* code section to measure */
-trace_performance_since(start, "foobar");
-------------
-+
-------------
-uint64_t t = 0;
-for (;;) {
- /* ignore */
- t -= getnanotime();
- /* code section to measure */
- t += getnanotime();
- /* ignore */
-}
-trace_performance(t, "frotz");
-------------
-
-Bugs & Caveats
---------------
-
-GIT_TRACE_* environment variables can be used to tell Git to show
-trace output to its standard error stream. Git can often spawn a pager
-internally to run its subcommand and send its standard output and
-standard error to it.
-
-Because GIT_TRACE_PERFORMANCE trace is generated only at the very end
-of the program with atexit(), which happens after the pager exits, it
-would not work well if you send its log to the standard error output
-and let Git spawn the pager at the same time.
-
-As a work around, you can for example use '--no-pager', or set
-GIT_TRACE_PERFORMANCE to another file descriptor which is redirected
-to stderr, or set GIT_TRACE_PERFORMANCE to a file specified by its
-absolute path.
-
-For example instead of the following command which by default may not
-print any performance information:
-
-------------
-GIT_TRACE_PERFORMANCE=2 git log -1
-------------
-
-you may want to use:
-
-------------
-GIT_TRACE_PERFORMANCE=2 git --no-pager log -1
-------------
-
-or:
-
-------------
-GIT_TRACE_PERFORMANCE=3 3>&2 git log -1
-------------
-
-or:
-
-------------
-GIT_TRACE_PERFORMANCE=/path/to/log/file git log -1
-------------
diff --git a/Documentation/technical/api-trace2.txt b/Documentation/technical/api-trace2.txt
index a045dbe..de5fc25 100644
--- a/Documentation/technical/api-trace2.txt
+++ b/Documentation/technical/api-trace2.txt
@@ -5,7 +5,7 @@ information to stderr or a file. The Trace2 feature is inactive unless
explicitly enabled by enabling one or more Trace2 Targets.
The Trace2 API is intended to replace the existing (Trace1)
-printf-style tracing provided by the existing `GIT_TRACE` and
+`printf()`-style tracing provided by the existing `GIT_TRACE` and
`GIT_TRACE_PERFORMANCE` facilities. During initial implementation,
Trace2 and Trace1 may operate in parallel.
@@ -24,8 +24,8 @@ for example.
Trace2 is controlled using `trace2.*` config values in the system and
global config files and `GIT_TRACE2*` environment variables. Trace2 does
-not read from repo local or worktree config files or respect `-c`
-command line config settings.
+not read from repo local or worktree config files, nor does it respect
+`-c` command line config settings.
== Trace2 Targets
@@ -34,8 +34,8 @@ Format details are given in a later section.
=== The Normal Format Target
-The normal format target is a tradition printf format and similar
-to GIT_TRACE format. This format is enabled with the `GIT_TRACE2`
+The normal format target is a traditional `printf()` format and similar
+to the `GIT_TRACE` format. This format is enabled with the `GIT_TRACE2`
environment variable or the `trace2.normalTarget` system or global
config setting.
@@ -69,8 +69,8 @@ $ cat ~/log.normal
=== The Performance Format Target
The performance format target (PERF) is a column-based format to
-replace GIT_TRACE_PERFORMANCE and is suitable for development and
-testing, possibly to complement tools like gprof. This format is
+replace `GIT_TRACE_PERFORMANCE` and is suitable for development and
+testing, possibly to complement tools like `gprof`. This format is
enabled with the `GIT_TRACE2_PERF` environment variable or the
`trace2.perfTarget` system or global config setting.
@@ -128,7 +128,7 @@ yields
------------
$ cat ~/log.event
-{"event":"version","sid":"sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.620713Z","file":"common-main.c","line":38,"evt":"2","exe":"2.20.1.155.g426c96fcdb"}
+{"event":"version","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.620713Z","file":"common-main.c","line":38,"evt":"3","exe":"2.20.1.155.g426c96fcdb"}
{"event":"start","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621027Z","file":"common-main.c","line":39,"t_abs":0.001173,"argv":["git","version"]}
{"event":"cmd_name","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621122Z","file":"git.c","line":432,"name":"version","hierarchy":"version"}
{"event":"exit","sid":"20190408T191610.507018Z-H9b68c35f-P000059a8","thread":"main","time":"2019-01-16T17:28:42.621236Z","file":"git.c","line":662,"t_abs":0.001227,"code":0}
@@ -148,20 +148,18 @@ filename collisions).
== Trace2 API
-All public Trace2 functions and macros are defined in `trace2.h` and
-`trace2.c`. All public symbols are prefixed with `trace2_`.
+The Trace2 public API is defined and documented in `trace2.h`; refer to it for
+more information. All public functions and macros are prefixed
+with `trace2_` and are implemented in `trace2.c`.
There are no public Trace2 data structures.
The Trace2 code also defines a set of private functions and data types
in the `trace2/` directory. These symbols are prefixed with `tr2_`
-and should only be used by functions in `trace2.c`.
+and should only be used by functions in `trace2.c` (or other private
+source files in `trace2/`).
-== Conventions for Public Functions and Macros
-
-The functions defined by the Trace2 API are declared and documented
-in `trace2.h`. It defines the API functions and wrapper macros for
-Trace2.
+=== Conventions for Public Functions and Macros
Some functions have a `_fl()` suffix to indicate that they take `file`
and `line-number` arguments.
@@ -170,279 +168,9 @@ Some functions have a `_va_fl()` suffix to indicate that they also
take a `va_list` argument.
Some functions have a `_printf_fl()` suffix to indicate that they also
-take a varargs argument.
-
-There are CPP wrapper macros and ifdefs to hide most of these details.
-See `trace2.h` for more details. The following discussion will only
-describe the simplified forms.
-
-== Public API
-
-All Trace2 API functions send a messsage to all of the active
-Trace2 Targets. This section describes the set of available
-messages.
-
-It helps to divide these functions into groups for discussion
-purposes.
-
-=== Basic Command Messages
-
-These are concerned with the lifetime of the overall git process.
-
-`void trace2_initialize_clock()`::
-
- Initialize the Trace2 start clock and nothing else. This should
- be called at the very top of main() to capture the process start
- time and reduce startup order dependencies.
-
-`void trace2_initialize()`::
-
- Determines if any Trace2 Targets should be enabled and
- initializes the Trace2 facility. This includes setting up the
- Trace2 thread local storage (TLS).
-+
-This function emits a "version" message containing the version of git
-and the Trace2 protocol.
-+
-This function should be called from `main()` as early as possible in
-the life of the process after essential process initialization.
-
-`int trace2_is_enabled()`::
-
- Returns 1 if Trace2 is enabled (at least one target is
- active).
-
-`void trace2_cmd_start(int argc, const char **argv)`::
-
- Emits a "start" message containing the process command line
- arguments.
-
-`int trace2_cmd_exit(int exit_code)`::
-
- Emits an "exit" message containing the process exit-code and
- elapsed time.
-+
-Returns the exit-code.
-
-`void trace2_cmd_error(const char *fmt, va_list ap)`::
-
- Emits an "error" message containing a formatted error message.
-
-`void trace2_cmd_path(const char *pathname)`::
-
- Emits a "cmd_path" message with the full pathname of the
- current process.
-
-=== Command Detail Messages
-
-These are concerned with describing the specific Git command
-after the command line, config, and environment are inspected.
-
-`void trace2_cmd_name(const char *name)`::
-
- Emits a "cmd_name" message with the canonical name of the
- command, for example "status" or "checkout".
-
-`void trace2_cmd_mode(const char *mode)`::
-
- Emits a "cmd_mode" message with a qualifier name to further
- describe the current git command.
-+
-This message is intended to be used with git commands having multiple
-major modes. For example, a "checkout" command can checkout a new
-branch or it can checkout a single file, so the checkout code could
-emit a cmd_mode message of "branch" or "file".
-
-`void trace2_cmd_alias(const char *alias, const char **argv_expansion)`::
-
- Emits an "alias" message containing the alias used and the
- argument expansion.
-
-`void trace2_def_param(const char *parameter, const char *value)`::
-
- Emits a "def_param" message containing a key/value pair.
-+
-This message is intended to report some global aspect of the current
-command, such as a configuration setting or command line switch that
-significantly affects program performance or behavior, such as
-`core.abbrev`, `status.showUntrackedFiles`, or `--no-ahead-behind`.
-
-`void trace2_cmd_list_config()`::
-
- Emits a "def_param" messages for "important" configuration
- settings.
-+
-The environment variable `GIT_TRACE2_CONFIG_PARAMS` or the `trace2.configParams`
-config value can be set to a
-list of patterns of important configuration settings, for example:
-`core.*,remote.*.url`. This function will iterate over all config
-settings and emit a "def_param" message for each match.
-
-`void trace2_cmd_set_config(const char *key, const char *value)`::
-
- Emits a "def_param" message for a new or updated key/value
- pair IF `key` is considered important.
-+
-This is used to hook into `git_config_set()` and catch any
-configuration changes and update a value previously reported by
-`trace2_cmd_list_config()`.
-
-`void trace2_def_repo(struct repository *repo)`::
-
- Registers a repository with the Trace2 layer. Assigns a
- unique "repo-id" to `repo->trace2_repo_id`.
-+
-Emits a "worktree" messages containing the repo-id and the worktree
-pathname.
-+
-Region and data messages (described later) may refer to this repo-id.
-+
-The main/top-level repository will have repo-id value 1 (aka "r1").
-+
-The repo-id field is in anticipation of future in-proc submodule
-repositories.
-
-=== Child Process Messages
-
-These are concerned with the various spawned child processes,
-including shell scripts, git commands, editors, pagers, and hooks.
-
-`void trace2_child_start(struct child_process *cmd)`::
-
- Emits a "child_start" message containing the "child-id",
- "child-argv", and "child-classification".
-+
-Before calling this, set `cmd->trace2_child_class` to a name
-describing the type of child process, for example "editor".
-+
-This function assigns a unique "child-id" to `cmd->trace2_child_id`.
-This field is used later during the "child_exit" message to associate
-it with the "child_start" message.
-+
-This function should be called before spawning the child process.
-
-`void trace2_child_exit(struct child_proess *cmd, int child_exit_code)`::
-
- Emits a "child_exit" message containing the "child-id",
- the child's elapsed time and exit-code.
-+
-The reported elapsed time includes the process creation overhead and
-time spend waiting for it to exit, so it may be slightly longer than
-the time reported by the child itself.
-+
-This function should be called after reaping the child process.
-
-`int trace2_exec(const char *exe, const char **argv)`::
-
- Emits a "exec" message containing the "exec-id" and the
- argv of the new process.
-+
-This function should be called before calling one of the `exec()`
-variants, such as `execvp()`.
-+
-This function returns a unique "exec-id". This value is used later
-if the exec() fails and a "exec-result" message is necessary.
-
-`void trace2_exec_result(int exec_id, int error_code)`::
-
- Emits a "exec_result" message containing the "exec-id"
- and the error code.
-+
-On Unix-based systems, `exec()` does not return if successful.
-This message is used to indicate that the `exec()` failed and
-that the current program is continuing.
-
-=== Git Thread Messages
-
-These messages are concerned with Git thread usage.
-
-`void trace2_thread_start(const char *thread_name)`::
-
- Emits a "thread_start" message.
-+
-The `thread_name` field should be a descriptive name, such as the
-unique name of the thread-proc. A unique "thread-id" will be added
-to the name to uniquely identify thread instances.
-+
-Region and data messages (described later) may refer to this thread
-name.
-+
-This function must be called by the thread-proc of the new thread
-(so that TLS data is properly initialized) and not by the caller
-of `pthread_create()`.
-
-`void trace2_thread_exit()`::
-
- Emits a "thread_exit" message containing the thread name
- and the thread elapsed time.
-+
-This function must be called by the thread-proc before it returns
-(so that the coorect TLS data is used and cleaned up. It should
-not be called by the caller of `pthread_join()`.
-
-=== Region and Data Messages
-
-These are concerned with recording performance data
-over regions or spans of code.
-
-`void trace2_region_enter(const char *category, const char *label, const struct repository *repo)`::
-
-`void trace2_region_enter_printf(const char *category, const char *label, const struct repository *repo, const char *fmt, ...)`::
-
-`void trace2_region_enter_printf_va(const char *category, const char *label, const struct repository *repo, const char *fmt, va_list ap)`::
-
- Emits a thread-relative "region_enter" message with optional
- printf string.
-+
-This function pushes a new region nesting stack level on the current
-thread and starts a clock for the new stack frame.
-+
-The `category` field is an arbitrary category name used to classify
-regions by feature area, such as "status" or "index". At this time
-it is only just printed along with the rest of the message. It may
-be used in the future to filter messages.
-+
-The `label` field is an arbitrary label used to describe the activity
-being started, such as "read_recursive" or "do_read_index".
-+
-The `repo` field, if set, will be used to get the "repo-id", so that
-recursive oerations can be attributed to the correct repository.
-
-`void trace2_region_leave(const char *category, const char *label, const struct repository *repo)`::
-
-`void trace2_region_leave_printf(const char *category, const char *label, const struct repository *repo, const char *fmt, ...)`::
-
-`void trace2_region_leave_printf_va(const char *category, const char *label, const struct repository *repo, const char *fmt, va_list ap)`::
-
- Emits a thread-relative "region_leave" message with optional
- printf string.
-+
-This function pops the region nesting stack on the current thread
-and reports the elapsed time of the stack frame.
-+
-The `category`, `label`, and `repo` fields are the same as above.
-The `category` and `label` do not need to match the correpsonding
-"region_enter" message, but it makes the data stream easier to
-understand.
-
-`void trace2_data_string(const char *category, const struct repository *repo, const char *key, const char * value)`::
-
-`void trace2_data_intmax(const char *category, const struct repository *repo, const char *key, intmax value)`::
-
-`void trace2_data_json(const char *category, const struct repository *repo, const char *key, const struct json_writer *jw)`::
-
- Emits a region- and thread-relative "data" or "data_json" message.
-+
-This is a key/value pair message containing information about the
-current thread, region stack, and repository. This could be used
-to print the number of files in a directory during a multi-threaded
-recursive tree walk.
-
-`void trace2_printf(const char *fmt, ...)`::
+take a `printf()` style format with a variable number of arguments.
-`void trace2_printf_va(const char *fmt, va_list ap)`::
-
- Emits a region- and thread-relative "printf" message.
+CPP wrapper macros are defined to hide most of these details.
== Trace2 Target Formats
@@ -459,7 +187,7 @@ Events are written as lines of the form:
is the event name.
`<event-message>`::
- is a free-form printf message intended for human consumption.
+ is a free-form `printf()` message intended for human consumption.
+
Note that this may contain embedded LF or CRLF characters that are
not escaped, so the event may spill across multiple lines.
@@ -525,7 +253,7 @@ This field is in anticipation of in-proc submodules in the future.
indicate a broad category, such as "index" or "status".
`<perf-event-message>`::
- is a free-form printf message intended for human consumption.
+ is a free-form `printf()` message intended for human consumption.
------------
15:33:33.532712 wt-status.c:2310 | d0 | main | region_enter | r1 | 0.126064 | | status | label:print
@@ -616,19 +344,19 @@ only present on the "start" and "atexit" events.
{
"event":"version",
...
- "evt":"2", # EVENT format version
+ "evt":"3", # EVENT format version
"exe":"2.20.1.155.g426c96fcdb" # git version
}
------------
-`"discard"`::
+`"too_many_files"`::
This event is written to the git-trace2-discard sentinel file if there
are too many files in the target trace directory (see the
trace2.maxFiles config option).
+
------------
{
- "event":"discard",
+ "event":"too_many_files",
...
}
------------
@@ -690,8 +418,8 @@ completed.)
------------
`"error"`::
- This event is emitted when one of the `error()`, `die()`,
- or `usage()` functions are called.
+ This event is emitted when one of the `BUG()`, `bug()`, `error()`,
+ `die()`, `warning()`, or `usage()` functions are called.
+
------------
{
@@ -718,6 +446,20 @@ about specific error arguments.
}
------------
+`"cmd_ancestry"`::
+ This event contains the text command name for the parent (and earlier
+ generations of parents) of the current process, in an array ordered from
+ nearest parent to furthest great-grandparent. It may not be implemented
+ on all platforms.
++
+------------
+{
+ "event":"cmd_ancestry",
+ ...
+ "ancestry":["bash","tmux: server","systemd"]
+}
+------------
+
`"cmd_name"`::
This event contains the command name for this git process
and the hierarchy of commands from parent git processes.
@@ -744,7 +486,7 @@ these special values are used:
------------
`"cmd_mode"`::
- This event, when present, describes the command variant This
+ This event, when present, describes the command variant. This
event may be emitted more than once.
+
------------
@@ -799,7 +541,7 @@ with "?".
`"child_exit"`::
This event is generated after the current process has returned
- from the waitpid() and collected the exit information from the
+ from the `waitpid()` and collected the exit information from the
child.
+
------------
@@ -816,14 +558,54 @@ with "?".
Note that the session-id of the child process is not available to
the current/spawning process, so the child's PID is reported here as
a hint for post-processing. (But it is only a hint because the child
-proces may be a shell script which doesn't have a session-id.)
+process may be a shell script which doesn't have a session-id.)
+
Note that the `t_rel` field contains the observed run time in seconds
for the child process (starting before the fork/exec/spawn and
-stopping after the waitpid() and includes OS process creation overhead).
+stopping after the `waitpid()` and includes OS process creation overhead).
So this time will be slightly larger than the atexit time reported by
the child process itself.
+`"child_ready"`::
+ This event is generated after the current process has started
+ a background process and released all handles to it.
++
+------------
+{
+ "event":"child_ready",
+ ...
+ "child_id":2,
+ "pid":14708, # child PID
+ "ready":"ready", # child ready state
+ "t_rel":0.110605 # observed run-time of child process
+}
+------------
++
+Note that the session-id of the child process is not available to
+the current/spawning process, so the child's PID is reported here as
+a hint for post-processing. (But it is only a hint because the child
+process may be a shell script which doesn't have a session-id.)
++
+This event is generated after the child is started in the background
+and given a little time to boot up and start working. If the child
+starts up normally while the parent is still waiting, the "ready"
+field will have the value "ready".
+If the child is too slow to start and the parent times out, the field
+will have the value "timeout".
+If the child starts but the parent is unable to probe it, the field
+will have the value "error".
++
+After the parent process emits this event, it will release all of its
+handles to the child process and treat the child as a background
+daemon. So even if the child does eventually finish booting up,
+the parent will not emit an updated event.
++
+Note that the `t_rel` field contains the observed run time in seconds
+when the parent released the child process into the background.
+The child is assumed to be a long-running daemon process and may
+outlive the parent process. So the parent's child event times should
+not be compared to the child's atexit times.
+
`"exec"`::
This event is generated before git attempts to `exec()`
another command rather than starting a child process.
@@ -856,8 +638,8 @@ The "exec_id" field is a command-unique id and is only useful if the
`"thread_start"`::
This event is generated when a thread is started. It is
- generated from *within* the new thread's thread-proc (for TLS
- reasons).
+ generated from *within* the new thread's thread-proc (because
+ it needs to access data in the thread's thread-local storage).
+
------------
{
@@ -869,7 +651,7 @@ The "exec_id" field is a command-unique id and is only useful if the
`"thread_exit"`::
This event is generated when a thread exits. It is generated
- from *within* the thread's thread-proc (for TLS reasons).
+ from *within* the thread's thread-proc.
+
------------
{
@@ -881,12 +663,14 @@ The "exec_id" field is a command-unique id and is only useful if the
------------
`"def_param"`::
- This event is generated to log a global parameter.
+ This event is generated to log a global parameter, such as a config
+ setting, command-line flag, or environment variable.
+
------------
{
"event":"def_param",
...
+ "scope":"global",
"param":"core.abbrev",
"value":"7"
}
@@ -985,6 +769,73 @@ The "value" field may be an integer or a string.
}
------------
+`"th_timer"`::
+ This event logs the amount of time that a stopwatch timer was
+ running in the thread. This event is generated when a thread
+ exits for timers that requested per-thread events.
++
+------------
+{
+ "event":"th_timer",
+ ...
+ "category":"my_category",
+ "name":"my_timer",
+ "intervals":5, # number of time it was started/stopped
+ "t_total":0.052741, # total time in seconds it was running
+ "t_min":0.010061, # shortest interval
+ "t_max":0.011648 # longest interval
+}
+------------
+
+`"timer"`::
+ This event logs the amount of time that a stopwatch timer was
+ running aggregated across all threads. This event is generated
+ when the process exits.
++
+------------
+{
+ "event":"timer",
+ ...
+ "category":"my_category",
+ "name":"my_timer",
+ "intervals":5, # number of time it was started/stopped
+ "t_total":0.052741, # total time in seconds it was running
+ "t_min":0.010061, # shortest interval
+ "t_max":0.011648 # longest interval
+}
+------------
+
+`"th_counter"`::
+ This event logs the value of a counter variable in a thread.
+ This event is generated when a thread exits for counters that
+ requested per-thread events.
++
+------------
+{
+ "event":"th_counter",
+ ...
+ "category":"my_category",
+ "name":"my_counter",
+ "count":23
+}
+------------
+
+`"counter"`::
+ This event logs the value of a counter variable across all threads.
+ This event is generated when the process exits. The total value
+ reported here is the sum across all threads.
++
+------------
+{
+ "event":"counter",
+ ...
+ "category":"my_category",
+ "name":"my_counter",
+ "count":23
+}
+------------
+
+
== Example Trace2 API Usage
Here is a hypothetical usage of the Trace2 API showing the intended
@@ -1119,7 +970,7 @@ atexit elapsed:3.868970 code:0
Regions::
- Regions can be use to time an interesting section of code.
+ Regions can be used to time an interesting section of code.
+
----------------
void wt_status_collect(struct wt_status *s)
@@ -1176,7 +1027,7 @@ d0 | main | atexit | | 0.028809 | |
+
Regions may be nested. This causes messages to be indented in the
PERF target, for example.
-Elapsed times are relative to the start of the correpsonding nesting
+Elapsed times are relative to the start of the corresponding nesting
level as expected. For example, if we add region message to:
+
----------------
@@ -1273,9 +1124,9 @@ Thread Events::
Thread messages added to a thread-proc.
+
-For example, the multithreaded preload-index code can be
+For example, the multi-threaded preload-index code can be
instrumented with a region around the thread pool and then
-per-thread start and exit events within the threadproc.
+per-thread start and exit events within the thread-proc.
+
----------------
static void *preload_thread(void *_data)
@@ -1371,11 +1222,104 @@ d0 | main | atexit | | 0.030027 | |
In this example, the preload region took 0.009122 seconds. The 7 threads
took between 0.006069 and 0.008947 seconds to work on their portion of
the index. Thread "th01" worked on 508 items at offset 0. Thread "th02"
-worked on 508 items at offset 2032. Thread "th04" worked on 508 itemts
+worked on 508 items at offset 2032. Thread "th04" worked on 508 items
at offset 508.
+
This example also shows that thread names are assigned in a racy manner
-as each thread starts and allocates TLS storage.
+as each thread starts.
+
+Config (def param) Events::
+
+ Dump "interesting" config values to trace2 log.
++
+We can optionally emit configuration events, see
+`trace2.configparams` in linkgit:git-config[1] for how to enable
+it.
++
+----------------
+$ git config --system color.ui never
+$ git config --global color.ui always
+$ git config --local color.ui auto
+$ git config --list --show-scope | grep 'color.ui'
+system color.ui=never
+global color.ui=always
+local color.ui=auto
+----------------
++
+Then, mark the config `color.ui` as "interesting" config with
+`GIT_TRACE2_CONFIG_PARAMS`:
++
+----------------
+$ export GIT_TRACE2_PERF_BRIEF=1
+$ export GIT_TRACE2_PERF=~/log.perf
+$ export GIT_TRACE2_CONFIG_PARAMS=color.ui
+$ git version
+...
+$ cat ~/log.perf
+d0 | main | version | | | | | ...
+d0 | main | start | | 0.001642 | | | /usr/local/bin/git version
+d0 | main | cmd_name | | | | | version (version)
+d0 | main | def_param | | | | scope:system | color.ui:never
+d0 | main | def_param | | | | scope:global | color.ui:always
+d0 | main | def_param | | | | scope:local | color.ui:auto
+d0 | main | data | r0 | 0.002100 | 0.002100 | fsync | fsync/writeout-only:0
+d0 | main | data | r0 | 0.002126 | 0.002126 | fsync | fsync/hardware-flush:0
+d0 | main | exit | | 0.000470 | | | code:0
+d0 | main | atexit | | 0.000477 | | | code:0
+----------------
+
+Stopwatch Timer Events::
+
+ Measure the time spent in a function call or span of code
+ that might be called from many places within the code
+ throughout the life of the process.
++
+----------------
+static void expensive_function(void)
+{
+ trace2_timer_start(TRACE2_TIMER_ID_TEST1);
+ ...
+ sleep_millisec(1000); // Do something expensive
+ ...
+ trace2_timer_stop(TRACE2_TIMER_ID_TEST1);
+}
+
+static int ut_100timer(int argc, const char **argv)
+{
+ ...
+
+ expensive_function();
+
+ // Do something else 1...
+
+ expensive_function();
+
+ // Do something else 2...
+
+ expensive_function();
+
+ return 0;
+}
+----------------
++
+In this example, we measure the total time spent in
+`expensive_function()` regardless of when it is called
+in the overall flow of the program.
++
+----------------
+$ export GIT_TRACE2_PERF_BRIEF=1
+$ export GIT_TRACE2_PERF=~/log.perf
+$ t/helper/test-tool trace2 100timer 3 1000
+...
+$ cat ~/log.perf
+d0 | main | version | | | | | ...
+d0 | main | start | | 0.001453 | | | t/helper/test-tool trace2 100timer 3 1000
+d0 | main | cmd_name | | | | | trace2 (trace2)
+d0 | main | exit | | 3.003667 | | | code:0
+d0 | main | timer | | | | test | name:test1 intervals:3 total:3.001686 min:1.000254 max:1.000929
+d0 | main | atexit | | 3.003796 | | | code:0
+----------------
+
== Future Work
@@ -1384,11 +1328,11 @@ as each thread starts and allocates TLS storage.
There are a few issues to resolve before we can completely
switch to Trace2.
-* Updating existing tests that assume GIT_TRACE format messages.
+* Updating existing tests that assume `GIT_TRACE` format messages.
-* How to best handle custom GIT_TRACE_<key> messages?
+* How to best handle custom `GIT_TRACE_<key>` messages?
-** The GIT_TRACE_<key> mechanism allows each <key> to write to a
+** The `GIT_TRACE_<key>` mechanism allows each <key> to write to a
different file (in addition to just stderr).
** Do we want to maintain that ability or simply write to the existing
diff --git a/Documentation/technical/api-tree-walking.txt b/Documentation/technical/api-tree-walking.txt
deleted file mode 100644
index 7962e32..0000000
--- a/Documentation/technical/api-tree-walking.txt
+++ /dev/null
@@ -1,149 +0,0 @@
-tree walking API
-================
-
-The tree walking API is used to traverse and inspect trees.
-
-Data Structures
----------------
-
-`struct name_entry`::
-
- An entry in a tree. Each entry has a sha1 identifier, pathname, and
- mode.
-
-`struct tree_desc`::
-
- A semi-opaque data structure used to maintain the current state of the
- walk.
-+
-* `buffer` is a pointer into the memory representation of the tree. It always
-points at the current entry being visited.
-
-* `size` counts the number of bytes left in the `buffer`.
-
-* `entry` points to the current entry being visited.
-
-`struct traverse_info`::
-
- A structure used to maintain the state of a traversal.
-+
-* `prev` points to the traverse_info which was used to descend into the
-current tree. If this is the top-level tree `prev` will point to
-a dummy traverse_info.
-
-* `name` is the entry for the current tree (if the tree is a subtree).
-
-* `pathlen` is the length of the full path for the current tree.
-
-* `conflicts` can be used by callbacks to maintain directory-file conflicts.
-
-* `fn` is a callback called for each entry in the tree. See Traversing for more
-information.
-
-* `data` can be anything the `fn` callback would want to use.
-
-* `show_all_errors` tells whether to stop at the first error or not.
-
-Initializing
-------------
-
-`init_tree_desc`::
-
- Initialize a `tree_desc` and decode its first entry. The buffer and
- size parameters are assumed to be the same as the buffer and size
- members of `struct tree`.
-
-`fill_tree_descriptor`::
-
- Initialize a `tree_desc` and decode its first entry given the
- object ID of a tree. Returns the `buffer` member if the latter
- is a valid tree identifier and NULL otherwise.
-
-`setup_traverse_info`::
-
- Initialize a `traverse_info` given the pathname of the tree to start
- traversing from.
-
-Walking
--------
-
-`tree_entry`::
-
- Visit the next entry in a tree. Returns 1 when there are more entries
- left to visit and 0 when all entries have been visited. This is
- commonly used in the test of a while loop.
-
-`tree_entry_len`::
-
- Calculate the length of a tree entry's pathname. This utilizes the
- memory structure of a tree entry to avoid the overhead of using a
- generic strlen().
-
-`update_tree_entry`::
-
- Walk to the next entry in a tree. This is commonly used in conjunction
- with `tree_entry_extract` to inspect the current entry.
-
-`tree_entry_extract`::
-
- Decode the entry currently being visited (the one pointed to by
- `tree_desc's` `entry` member) and return the sha1 of the entry. The
- `pathp` and `modep` arguments are set to the entry's pathname and mode
- respectively.
-
-`get_tree_entry`::
-
- Find an entry in a tree given a pathname and the sha1 of a tree to
- search. Returns 0 if the entry is found and -1 otherwise. The third
- and fourth parameters are set to the entry's sha1 and mode
- respectively.
-
-Traversing
-----------
-
-`traverse_trees`::
-
- Traverse `n` number of trees in parallel. The `fn` callback member of
- `traverse_info` is called once for each tree entry.
-
-`traverse_callback_t`::
- The arguments passed to the traverse callback are as follows:
-+
-* `n` counts the number of trees being traversed.
-
-* `mask` has its nth bit set if something exists in the nth entry.
-
-* `dirmask` has its nth bit set if the nth tree's entry is a directory.
-
-* `entry` is an array of size `n` where the nth entry is from the nth tree.
-
-* `info` maintains the state of the traversal.
-
-+
-Returning a negative value will terminate the traversal. Otherwise the
-return value is treated as an update mask. If the nth bit is set the nth tree
-will be updated and if the bit is not set the nth tree entry will be the
-same in the next callback invocation.
-
-`make_traverse_path`::
-
- Generate the full pathname of a tree entry based from the root of the
- traversal. For example, if the traversal has recursed into another
- tree named "bar" the pathname of an entry "baz" in the "bar"
- tree would be "bar/baz".
-
-`traverse_path_len`::
-
- Calculate the length of a pathname returned by `make_traverse_path`.
- This utilizes the memory structure of a tree entry to avoid the
- overhead of using a generic strlen().
-
-`strbuf_make_traverse_path`::
-
- Convenience wrapper to `make_traverse_path` into a strbuf.
-
-Authors
--------
-
-Written by Junio C Hamano <gitster@pobox.com> and Linus Torvalds
-<torvalds@linux-foundation.org>
diff --git a/Documentation/technical/api-xdiff-interface.txt b/Documentation/technical/api-xdiff-interface.txt
deleted file mode 100644
index 6296eca..0000000
--- a/Documentation/technical/api-xdiff-interface.txt
+++ /dev/null
@@ -1,7 +0,0 @@
-xdiff interface API
-===================
-
-Talk about our calling convention to xdiff library, including
-xdiff_emit_consume_fn.
-
-(Dscho, JC)
diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
index f8c18a0..f5d2009 100644
--- a/Documentation/technical/bitmap-format.txt
+++ b/Documentation/technical/bitmap-format.txt
@@ -1,93 +1,157 @@
GIT bitmap v1 format
====================
- - A header appears at the beginning:
-
- 4-byte signature: {'B', 'I', 'T', 'M'}
-
- 2-byte version number (network byte order)
- The current implementation only supports version 1
- of the bitmap index (the same one as JGit).
-
- 2-byte flags (network byte order)
-
- The following flags are supported:
-
- - BITMAP_OPT_FULL_DAG (0x1) REQUIRED
- This flag must always be present. It implies that the bitmap
- index has been generated for a packfile with full closure
- (i.e. where every single object in the packfile can find
- its parent links inside the same packfile). This is a
- requirement for the bitmap index format, also present in JGit,
- that greatly reduces the complexity of the implementation.
-
- - BITMAP_OPT_HASH_CACHE (0x4)
- If present, the end of the bitmap file contains
- `N` 32-bit name-hash values, one per object in the
- pack. The format and meaning of the name-hash is
- described below.
-
- 4-byte entry count (network byte order)
-
- The total count of entries (bitmapped commits) in this bitmap index.
-
- 20-byte checksum
-
- The SHA1 checksum of the pack this bitmap index belongs to.
-
- - 4 EWAH bitmaps that act as type indexes
-
- Type indexes are serialized after the hash cache in the shape
- of four EWAH bitmaps stored consecutively (see Appendix A for
- the serialization format of an EWAH bitmap).
-
- There is a bitmap for each Git object type, stored in the following
- order:
-
- - Commits
- - Trees
- - Blobs
- - Tags
-
- In each bitmap, the `n`th bit is set to true if the `n`th object
- in the packfile is of that type.
-
- The obvious consequence is that the OR of all 4 bitmaps will result
- in a full set (all bits set), and the AND of all 4 bitmaps will
- result in an empty bitmap (no bits set).
-
- - N entries with compressed bitmaps, one for each indexed commit
-
- Where `N` is the total amount of entries in this bitmap index.
- Each entry contains the following:
-
- - 4-byte object position (network byte order)
- The position **in the index for the packfile** where the
- bitmap for this commit is found.
-
- - 1-byte XOR-offset
- The xor offset used to compress this bitmap. For an entry
- in position `x`, a XOR offset of `y` means that the actual
- bitmap representing this commit is composed by XORing the
- bitmap for this entry with the bitmap in entry `x-y` (i.e.
- the bitmap `y` entries before this one).
-
- Note that this compression can be recursive. In order to
- XOR this entry with a previous one, the previous entry needs
- to be decompressed first, and so on.
-
- The hard-limit for this offset is 160 (an entry can only be
- xor'ed against one of the 160 entries preceding it). This
- number is always positive, and hence entries are always xor'ed
- with **previous** bitmaps, not bitmaps that will come afterwards
- in the index.
-
- - 1-byte flags for this bitmap
- At the moment the only available flag is `0x1`, which hints
- that this bitmap can be re-used when rebuilding bitmap indexes
- for the repository.
-
- - The compressed bitmap itself, see Appendix A.
+== Pack and multi-pack bitmaps
+
+Bitmaps store reachability information about the set of objects in a packfile,
+or a multi-pack index (MIDX). The former is defined obviously, and the latter is
+defined as the union of objects in packs contained in the MIDX.
+
+A bitmap may belong to either one pack, or the repository's multi-pack index (if
+it exists). A repository may have at most one bitmap.
+
+An object is uniquely described by its bit position within a bitmap:
+
+ - If the bitmap belongs to a packfile, the __n__th bit corresponds to
+ the __n__th object in pack order. For a function `offset` which maps
+ objects to their byte offset within a pack, pack order is defined as
+ follows:
+
+ o1 <= o2 <==> offset(o1) <= offset(o2)
+
+ - If the bitmap belongs to a MIDX, the __n__th bit corresponds to the
+ __n__th object in MIDX order. With an additional function `pack` which
+ maps objects to the pack they were selected from by the MIDX, MIDX order
+ is defined as follows:
+
+ o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
++
+The ordering between packs is done according to the MIDX's .rev file.
+Notably, the preferred pack sorts ahead of all other packs.
+
+The on-disk representation (described below) of a bitmap is the same regardless
+of whether or not that bitmap belongs to a packfile or a MIDX. The only
+difference is the interpretation of the bits, which is described above.
+
+Certain bitmap extensions are supported (see: Appendix B). No extensions are
+required for bitmaps corresponding to packfiles. For bitmaps that correspond to
+MIDXs, both the bit-cache and rev-cache extensions are required.
+
+== On-disk format
+
+ * A header appears at the beginning:
+
+ 4-byte signature: :: {'B', 'I', 'T', 'M'}
+
+ 2-byte version number (network byte order): ::
+
+ The current implementation only supports version 1
+ of the bitmap index (the same one as JGit).
+
+ 2-byte flags (network byte order): ::
+
+ The following flags are supported:
+
+ ** {empty}
+ BITMAP_OPT_FULL_DAG (0x1) REQUIRED: :::
+
+ This flag must always be present. It implies that the
+ bitmap index has been generated for a packfile or
+ multi-pack index (MIDX) with full closure (i.e. where
+ every single object in the packfile/MIDX can find its
+ parent links inside the same packfile/MIDX). This is a
+ requirement for the bitmap index format, also present in
+ JGit, that greatly reduces the complexity of the
+ implementation.
+
+ ** {empty}
+ BITMAP_OPT_HASH_CACHE (0x4): :::
+
+ If present, the end of the bitmap file contains
+ `N` 32-bit name-hash values, one per object in the
+ pack/MIDX. The format and meaning of the name-hash is
+ described below.
+
+ ** {empty}
+ BITMAP_OPT_LOOKUP_TABLE (0x10): :::
+ If present, the end of the bitmap file contains a table
+ containing a list of `N` <commit_pos, offset, xor_row>
+ triplets. The format and meaning of the table is described
+ below.
++
+NOTE: Unlike the xor_offset used to compress an individual bitmap,
+`xor_row` stores an *absolute* index into the lookup table, not a location
+relative to the current entry.
+
+ 4-byte entry count (network byte order): ::
+ The total count of entries (bitmapped commits) in this bitmap index.
+
+ 20-byte checksum: ::
+ The SHA1 checksum of the pack/MIDX this bitmap index
+ belongs to.
+
+ * 4 EWAH bitmaps that act as type indexes
++
+Type indexes are serialized after the hash cache in the shape
+of four EWAH bitmaps stored consecutively (see Appendix A for
+the serialization format of an EWAH bitmap).
++
+There is a bitmap for each Git object type, stored in the following
+order:
++
+ - Commits
+ - Trees
+ - Blobs
+ - Tags
+
++
+In each bitmap, the `n`th bit is set to true if the `n`th object
+in the packfile or multi-pack index is of that type.
++
+The obvious consequence is that the OR of all 4 bitmaps will result
+in a full set (all bits set), and the AND of all 4 bitmaps will
+result in an empty bitmap (no bits set).
+
+ * N entries with compressed bitmaps, one for each indexed commit
++
+Where `N` is the total number of entries in this bitmap index.
+Each entry contains the following:
+
+ ** {empty}
+ 4-byte object position (network byte order): ::
+ The position **in the index for the packfile or
+ multi-pack index** where the bitmap for this commit is
+ found.
+
+ ** {empty}
+ 1-byte XOR-offset: ::
+ The xor offset used to compress this bitmap. For an entry
+ in position `x`, an XOR offset of `y` means that the actual
+ bitmap representing this commit is composed by XORing the
+ bitmap for this entry with the bitmap in entry `x-y` (i.e.
+ the bitmap `y` entries before this one).
++
+NOTE: This compression can be recursive. In order to
+XOR this entry with a previous one, the previous entry needs
+to be decompressed first, and so on.
++
+The hard-limit for this offset is 160 (an entry can only be
+xor'ed against one of the 160 entries preceding it). This
+number is always positive, and hence entries are always xor'ed
+with **previous** bitmaps, not bitmaps that will come afterwards
+in the index.
+
+ ** {empty}
+ 1-byte flags for this bitmap: ::
+ At the moment the only available flag is `0x1`, which hints
+ that this bitmap can be re-used when rebuilding bitmap indexes
+ for the repository.
+
+ ** The compressed bitmap itself, see Appendix A.
+
+ * {empty}
+ TRAILER: ::
+ Trailing checksum of the preceding contents.
== Appendix A: Serialization format for an EWAH bitmap
@@ -100,8 +164,8 @@ implementation:
- 4-byte number of words of the COMPRESSED bitmap, when stored
- N x 8-byte words, as specified by the previous field
-
- This is the actual content of the compressed bitmap.
++
+This is the actual content of the compressed bitmap.
- 4-byte position of the current RLW for the compressed
bitmap
@@ -146,10 +210,11 @@ Name-hash cache
---------------
If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains
-a cache of 32-bit values, one per object in the pack. The value at
+a cache of 32-bit values, one per object in the pack/MIDX. The value at
position `i` is the hash of the pathname at which the `i`th object
-(counting in index order) in the pack can be found. This can be fed
-into the delta heuristics to compare objects with similar pathnames.
+(counting in index or multi-pack index order) in the pack/MIDX can be found.
+This can be fed into the delta heuristics to compare objects with similar
+pathnames.
The hash algorithm used is:
@@ -162,3 +227,31 @@ Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag.
If implementations want to choose a different hashing scheme, they are
free to do so, but MUST allocate a new header flag (because comparing
hashes made under two different schemes would be pointless).
+
+Commit lookup table
+-------------------
+
+If the BITMAP_OPT_LOOKUP_TABLE flag is set, the last `N * (4 + 8 + 4)`
+bytes (preceding the name-hash cache and trailing hash) of the `.bitmap`
+file contains a lookup table specifying the information needed to get
+the desired bitmap from the entries without parsing previous unnecessary
+bitmaps.
+
+For a `.bitmap` containing `nr_entries` reachability bitmaps, the table
+contains a list of `nr_entries` <commit_pos, offset, xor_row> triplets
+(sorted in the ascending order of `commit_pos`). The content of the i'th
+triplet is -
+
+ * {empty}
+ commit_pos (4 byte integer, network byte order): ::
+ It stores the object position of a commit (in the midx or pack
+ index).
+
+ * {empty}
+ offset (8 byte integer, network byte order): ::
+ The offset from which that commit's bitmap can be read.
+
+ * {empty}
+ xor_row (4 byte integer, network byte order): ::
+ The position of the triplet whose bitmap is used to compress
+ this one, or `0xffffffff` if no such bitmap exists.
diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
new file mode 100644
index 0000000..91d3a13
--- /dev/null
+++ b/Documentation/technical/bundle-uri.txt
@@ -0,0 +1,572 @@
+Bundle URIs
+===========
+
+Git bundles are files that store a pack-file along with some extra metadata,
+including a set of refs and a (possibly empty) set of necessary commits. See
+linkgit:git-bundle[1] and linkgit:gitformat-bundle[5] for more information.
+
+Bundle URIs are locations where Git can download one or more bundles in
+order to bootstrap the object database in advance of fetching the remaining
+objects from a remote.
+
+One goal is to speed up clones and fetches for users with poor network
+connectivity to the origin server. Another benefit is to allow heavy users,
+such as CI build farms, to use local resources for the majority of Git data
+and thereby reducing the load on the origin server.
+
+To enable the bundle URI feature, users can specify a bundle URI using
+command-line options or the origin server can advertise one or more URIs
+via a protocol v2 capability.
+
+Design Goals
+------------
+
+The bundle URI standard aims to be flexible enough to satisfy multiple
+workloads. The bundle provider and the Git client have several choices in
+how they create and consume bundle URIs.
+
+* Bundles can have whatever name the server desires. This name could refer
+ to immutable data by using a hash of the bundle contents. However, this
+ means that a new URI will be needed after every update of the content.
+ This might be acceptable if the server is advertising the URI (and the
+ server is aware of new bundles being generated) but would not be
+ ergonomic for users using the command line option.
+
+* The bundles could be organized specifically for bootstrapping full
+ clones, but could also be organized with the intention of bootstrapping
+ incremental fetches. The bundle provider must decide on one of several
+ organization schemes to minimize client downloads during incremental
+ fetches, but the Git client can also choose whether to use bundles for
+ either of these operations.
+
+* The bundle provider can choose to support full clones, partial clones,
+ or both. The client can detect which bundles are appropriate for the
+ repository's partial clone filter, if any.
+
+* The bundle provider can use a single bundle (for clones only), or a
+ list of bundles. When using a list of bundles, the provider can specify
+ whether or not the client needs _all_ of the bundle URIs for a full
+ clone, or if _any_ one of the bundle URIs is sufficient. This allows the
+ bundle provider to use different URIs for different geographies.
+
+* The bundle provider can organize the bundles using heuristics, such as
+ creation tokens, to help the client prevent downloading bundles it does
+ not need. When the bundle provider does not provide these heuristics,
+ the client can use optimizations to minimize how much of the data is
+ downloaded.
+
+* The bundle provider does not need to be associated with the Git server.
+ The client can choose to use the bundle provider without it being
+ advertised by the Git server.
+
+* The client can choose to discover bundle providers that are advertised
+ by the Git server. This could happen during `git clone`, during
+ `git fetch`, both, or neither. The user can choose which combination
+ works best for them.
+
+* The client can choose to configure a bundle provider manually at any
+ time. The client can also choose to specify a bundle provider manually
+ as a command-line option to `git clone`.
+
+Each repository is different and every Git server has different needs.
+Hopefully the bundle URI feature is flexible enough to satisfy all needs.
+If not, then the feature can be extended through its versioning mechanism.
+
+Server requirements
+-------------------
+
+To provide a server-side implementation of bundle servers, no other parts
+of the Git protocol are required. This allows server maintainers to use
+static content solutions such as CDNs in order to serve the bundle files.
+
+At the current scope of the bundle URI feature, all URIs are expected to
+be HTTP(S) URLs where content is downloaded to a local file using a `GET`
+request to that URL. The server could include authentication requirements
+to those requests with the aim of triggering the configured credential
+helper for secure access. (Future extensions could use "file://" URIs or
+SSH URIs.)
+
+Assuming a `200 OK` response from the server, the content at the URL is
+inspected. First, Git attempts to parse the file as a bundle file of
+version 2 or higher. If the file is not a bundle, then the file is parsed
+as a plain-text file using Git's config parser. The key-value pairs in
+that config file are expected to describe a list of bundle URIs. If
+neither of these parse attempts succeed, then Git will report an error to
+the user that the bundle URI provided erroneous data.
+
+Any other data provided by the server is considered erroneous.
+
+Bundle Lists
+------------
+
+The Git server can advertise bundle URIs using a set of `key=value` pairs.
+A bundle URI can also serve a plain-text file in the Git config format
+containing these same `key=value` pairs. In both cases, we consider this
+to be a _bundle list_. The pairs specify information about the bundles
+that the client can use to make decisions for which bundles to download
+and which to ignore.
+
+A few keys focus on properties of the list itself.
+
+bundle.version::
+ (Required) This value provides a version number for the bundle
+ list. If a future Git change enables a feature that needs the Git
+ client to react to a new key in the bundle list file, then this version
+ will increment. The only current version number is 1, and if any other
+ value is specified then Git will fail to use this file.
+
+bundle.mode::
+ (Required) This value has one of two values: `all` and `any`. When `all`
+ is specified, then the client should expect to need all of the listed
+ bundle URIs that match their repository's requirements. When `any` is
+ specified, then the client should expect that any one of the bundle URIs
+ that match their repository's requirements will suffice. Typically, the
+ `any` option is used to list a number of different bundle servers
+ located in different geographies.
+
+bundle.heuristic::
+ If this string-valued key exists, then the bundle list is designed to
+ work well with incremental `git fetch` commands. The heuristic signals
+ that there are additional keys available for each bundle that help
+ determine which subset of bundles the client should download. The only
+ heuristic currently planned is `creationToken`.
+
+The remaining keys include an `<id>` segment which is a server-designated
+name for each available bundle. The `<id>` must contain only alphanumeric
+and `-` characters.
+
+bundle.<id>.uri::
+ (Required) This string value is the URI for downloading bundle `<id>`.
+ If the URI begins with a protocol (`http://` or `https://`) then the URI
+ is absolute. Otherwise, the URI is interpreted as relative to the URI
+ used for the bundle list. If the URI begins with `/`, then that relative
+ path is relative to the domain name used for the bundle list. (This use
+ of relative paths is intended to make it easier to distribute a set of
+ bundles across a large number of servers or CDNs with different domain
+ names.)
+
+bundle.<id>.filter::
+ This string value represents an object filter that should also appear in
+ the header of this bundle. The server uses this value to differentiate
+ different kinds of bundles from which the client can choose those that
+ match their object filters.
+
+bundle.<id>.creationToken::
+ This value is a nonnegative 64-bit integer used for sorting the bundles
+ list. This is used to download a subset of bundles during a fetch when
+ `bundle.heuristic=creationToken`.
+
+bundle.<id>.location::
+ This string value advertises a real-world location from where the bundle
+ URI is served. This can be used to present the user with an option for
+ which bundle URI to use or simply as an informative indicator of which
+ bundle URI was selected by Git. This is only valuable when
+ `bundle.mode` is `any`.
+
+Here is an example bundle list using the Git config format:
+
+ [bundle]
+ version = 1
+ mode = all
+ heuristic = creationToken
+
+ [bundle "2022-02-09-1644442601-daily"]
+ uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
+ creationToken = 1644442601
+
+ [bundle "2022-02-02-1643842562"]
+ uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
+ creationToken = 1643842562
+
+ [bundle "2022-02-09-1644442631-daily-blobless"]
+ uri = 2022-02-09-1644442631-daily-blobless.bundle
+ creationToken = 1644442631
+ filter = blob:none
+
+ [bundle "2022-02-02-1643842568-blobless"]
+ uri = /git/git/2022-02-02-1643842568-blobless.bundle
+ creationToken = 1643842568
+ filter = blob:none
+
+This example uses `bundle.mode=all` as well as the
+`bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter`
+options to present two parallel sets of bundles: one for full clones and
+another for blobless partial clones.
+
+Suppose that this bundle list was found at the URI
+`https://bundles.example.com/git/git/` and so the two blobless bundles have
+the following fully-expanded URIs:
+
+* `https://bundles.example.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
+* `https://bundles.example.com/git/git/2022-02-02-1643842568-blobless.bundle`
+
+Advertising Bundle URIs
+-----------------------
+
+If a user knows a bundle URI for the repository they are cloning, then
+they can specify that URI manually through a command-line option. However,
+a Git host may want to advertise bundle URIs during the clone operation,
+helping users unaware of the feature.
+
+The only thing required for this feature is that the server can advertise
+one or more bundle URIs. This advertisement takes the form of a new
+protocol v2 capability specifically for discovering bundle URIs.
+
+The client could choose an arbitrary bundle URI as an option _or_ select
+the URI with best performance by some exploratory checks. It is up to the
+bundle provider to decide if having multiple URIs is preferable to a
+single URI that is geodistributed through server-side infrastructure.
+
+Cloning with Bundle URIs
+------------------------
+
+The primary need for bundle URIs is to speed up clones. The Git client
+will interact with bundle URIs according to the following flow:
+
+1. The user specifies a bundle URI with the `--bundle-uri` command-line
+ option _or_ the client discovers a bundle list advertised by the
+ Git server.
+
+2. If the downloaded data from a bundle URI is a bundle, then the client
+ inspects the bundle headers to check that the prerequisite commit OIDs
+ are present in the client repository. If some are missing, then the
+ client delays unbundling until other bundles have been unbundled,
+ making those OIDs present. When all required OIDs are present, the
+ client unbundles that data using a refspec. The default refspec is
+ `+refs/heads/*:refs/bundles/*`, but this can be configured. These refs
+ are stored so that later `git fetch` negotiations can communicate each
+ bundled ref as a `have`, reducing the size of the fetch over the Git
+ protocol. To allow pruning refs from this ref namespace, Git may
+ introduce a numbered namespace (such as `refs/bundles/<i>/*`) such that
+ stale bundle refs can be deleted.
+
+3. If the file is instead a bundle list, then the client inspects the
+ `bundle.mode` to see if the list is of the `all` or `any` form.
+
+ a. If `bundle.mode=all`, then the client considers all bundle
+ URIs. The list is reduced based on the `bundle.<id>.filter` options
+ matching the client repository's partial clone filter. Then, all
+ bundle URIs are requested. If the `bundle.<id>.creationToken`
+ heuristic is provided, then the bundles are downloaded in decreasing
+ order by the creation token, stopping when a bundle has all required
+ OIDs. The bundles can then be unbundled in increasing creation token
+ order. The client stores the latest creation token as a heuristic
+ for avoiding future downloads if the bundle list does not advertise
+ bundles with larger creation tokens.
+
+ b. If `bundle.mode=any`, then the client can choose any one of the
+ bundle URIs to inspect. The client can use a variety of ways to
+ choose among these URIs. The client can also fallback to another URI
+ if the initial choice fails to return a result.
+
+Note that during a clone we expect that all bundles will be required, and
+heuristics such as `bundle.<uri>.creationToken` can be used to download
+bundles in chronological order or in parallel.
+
+If a given bundle URI is a bundle list with a `bundle.heuristic`
+value, then the client can choose to store that URI as its chosen bundle
+URI. The client can then navigate directly to that URI during later `git
+fetch` calls.
+
+When downloading bundle URIs, the client can choose to inspect the initial
+content before committing to downloading the entire content. This may
+provide enough information to determine if the URI is a bundle list or
+a bundle. In the case of a bundle, the client may inspect the bundle
+header to determine that all advertised tips are already in the client
+repository and cancel the remaining download.
+
+Fetching with Bundle URIs
+-------------------------
+
+When the client fetches new data, it can decide to fetch from bundle
+servers before fetching from the origin remote. This could be done via a
+command-line option, but it is more likely useful to use a config value
+such as the one specified during the clone.
+
+The fetch operation follows the same procedure to download bundles from a
+bundle list (although we do _not_ want to use parallel downloads here). We
+expect that the process will end when all prerequisite commit OIDs in a
+thin bundle are already in the object database.
+
+When using the `creationToken` heuristic, the client can avoid downloading
+any bundles if their creation tokens are not larger than the stored
+creation token. After fetching new bundles, Git updates this local
+creation token.
+
+If the bundle provider does not provide a heuristic, then the client
+should attempt to inspect the bundle headers before downloading the full
+bundle data in case the bundle tips already exist in the client
+repository.
+
+Error Conditions
+----------------
+
+If the Git client discovers something unexpected while downloading
+information according to a bundle URI or the bundle list found at that
+location, then Git can ignore that data and continue as if it was not
+given a bundle URI. The remote Git server is the ultimate source of truth,
+not the bundle URI.
+
+Here are a few example error conditions:
+
+* The client fails to connect with a server at the given URI or a connection
+ is lost without any chance to recover.
+
+* The client receives a 400-level response (such as `404 Not Found` or
+ `401 Not Authorized`). The client should use the credential helper to
+ find and provide a credential for the URI, but match the semantics of
+ Git's other HTTP protocols in terms of handling specific 400-level
+ errors.
+
+* The server reports any other failure response.
+
+* The client receives data that is not parsable as a bundle or bundle list.
+
+* A bundle includes a filter that does not match expectations.
+
+* The client cannot unbundle the bundles because the prerequisite commit OIDs
+ are not in the object database and there are no more bundles to download.
+
+There are also situations that could be seen as wasteful, but are not
+error conditions:
+
+* The downloaded bundles contain more information than is requested by
+ the clone or fetch request. A primary example is if the user requests
+ a clone with `--single-branch` but downloads bundles that store every
+ reachable commit from all `refs/heads/*` references. This might be
+ initially wasteful, but perhaps these objects will become reachable by
+ a later ref update that the client cares about.
+
+* A bundle download during a `git fetch` contains objects already in the
+ object database. This is probably unavoidable if we are using bundles
+ for fetches, since the client will almost always be slightly ahead of
+ the bundle servers after performing its "catch-up" fetch to the remote
+ server. This extra work is most wasteful when the client is fetching
+ much more frequently than the server is computing bundles, such as if
+ the client is using hourly prefetches with background maintenance, but
+ the server is computing bundles weekly. For this reason, the client
+ should not use bundle URIs for fetch unless the server has explicitly
+ recommended it through a `bundle.heuristic` value.
+
+Example Bundle Provider organization
+------------------------------------
+
+The bundle URI feature is intentionally designed to be flexible to
+different ways a bundle provider wants to organize the object data.
+However, it can be helpful to have a complete organization model described
+here so providers can start from that base.
+
+This example organization is a simplified model of what is used by the
+GVFS Cache Servers (see section near the end of this document) which have
+been beneficial in speeding up clones and fetches for very large
+repositories, although using extra software outside of Git.
+
+The bundle provider deploys servers across multiple geographies. Each
+server manages its own bundle set. The server can track a number of Git
+repositories, but provides a bundle list for each based on a pattern. For
+example, when mirroring a repository at `https://<domain>/<org>/<repo>`
+the bundle server could have its bundle list available at
+`https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can
+list all of these servers under the "any" mode:
+
+ [bundle]
+ version = 1
+ mode = any
+
+ [bundle "eastus"]
+ uri = https://eastus.example.com/<domain>/<org>/<repo>
+
+ [bundle "europe"]
+ uri = https://europe.example.com/<domain>/<org>/<repo>
+
+ [bundle "apac"]
+ uri = https://apac.example.com/<domain>/<org>/<repo>
+
+This "list of lists" is static and only changes if a bundle server is
+added or removed.
+
+Each bundle server manages its own set of bundles. The initial bundle list
+contains only a single bundle, containing all of the objects received from
+cloning the repository from the origin server. The list uses the
+`creationToken` heuristic and a `creationToken` is made for the bundle
+based on the server's timestamp.
+
+The bundle server runs regularly-scheduled updates for the bundle list,
+such as once a day. During this task, the server fetches the latest
+contents from the origin server and generates a bundle containing the
+objects reachable from the latest origin refs, but not contained in a
+previously-computed bundle. This bundle is added to the list, with care
+that the `creationToken` is strictly greater than the previous maximum
+`creationToken`.
+
+When the bundle list grows too large, say more than 30 bundles, then the
+oldest "_N_ minus 30" bundles are combined into a single bundle. This
+bundle's `creationToken` is equal to the maximum `creationToken` among the
+merged bundles.
+
+An example bundle list is provided here, although it only has two daily
+bundles and not a full list of 30:
+
+ [bundle]
+ version = 1
+ mode = all
+ heuristic = creationToken
+
+ [bundle "2022-02-13-1644770820-daily"]
+ uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644770820-daily.bundle
+ creationToken = 1644770820
+
+ [bundle "2022-02-09-1644442601-daily"]
+ uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644442601-daily.bundle
+ creationToken = 1644442601
+
+ [bundle "2022-02-02-1643842562"]
+ uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-02-1643842562.bundle
+ creationToken = 1643842562
+
+To avoid storing and serving object data in perpetuity despite becoming
+unreachable in the origin server, this bundle merge can be more careful.
+Instead of taking an absolute union of the old bundles, instead the bundle
+can be created by looking at the newer bundles and ensuring that their
+necessary commits are all available in this merged bundle (or in another
+one of the newer bundles). This allows "expiring" object data that is not
+being used by new commits in this window of time. That data could be
+reintroduced by a later push.
+
+The intention of this data organization has two main goals. First, initial
+clones of the repository become faster by downloading precomputed object
+data from a closer source. Second, `git fetch` commands can be faster,
+especially if the client has not fetched for a few days. However, if a
+client does not fetch for 30 days, then the bundle list organization would
+cause redownloading a large amount of object data.
+
+One way to make this organization more useful to users who fetch frequently
+is to have more frequent bundle creation. For example, bundles could be
+created every hour, and then once a day those "hourly" bundles could be
+merged into a "daily" bundle. The daily bundles are merged into the
+oldest bundle after 30 days.
+
+It is recommended that this bundle strategy is repeated with the `blob:none`
+filter if clients of this repository are expecting to use blobless partial
+clones. This list of blobless bundles stays in the same list as the full
+bundles, but uses the `bundle.<id>.filter` key to separate the two groups.
+For very large repositories, the bundle provider may want to _only_ provide
+blobless bundles.
+
+Implementation Plan
+-------------------
+
+This design document is being submitted on its own as an aspirational
+document, with the goal of implementing all of the mentioned client
+features over the course of several patch series. Here is a potential
+outline for submitting these features:
+
+1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
+ This will include a new `git fetch --bundle-uri` mode for use as the
+ implementation underneath `git clone`. The initial version here will
+ expect a single bundle at the given URI.
+
+2. Implement the ability to parse a bundle list from a bundle URI and
+ update the `git fetch --bundle-uri` logic to properly distinguish
+ between `bundle.mode` options. Specifically design the feature so
+ that the config format parsing feeds a list of key-value pairs into the
+ bundle list logic.
+
+3. Create the `bundle-uri` protocol v2 command so Git servers can advertise
+ bundle URIs using the key-value pairs. Plug into the existing key-value
+ input to the bundle list logic. Allow `git clone` to discover these
+ bundle URIs and bootstrap the client repository from the bundle data.
+ (This choice is an opt-in via a config option and a command-line
+ option.)
+
+4. Allow the client to understand the `bundle.heuristic` configuration key
+ and the `bundle.<id>.creationToken` heuristic. When `git clone`
+ discovers a bundle URI with `bundle.heuristic`, it configures the client
+ repository to check that bundle URI during later `git fetch <remote>`
+ commands.
+
+5. Allow clients to discover bundle URIs during `git fetch` and configure
+ a bundle URI for later fetches if `bundle.heuristic` is set.
+
+6. Implement the "inspect headers" heuristic to reduce data downloads when
+ the `bundle.<id>.creationToken` heuristic is not available.
+
+As these features are reviewed, this plan might be updated. We also expect
+that new designs will be discovered and implemented as this feature
+matures and becomes used in real-world scenarios.
+
+Related Work: Packfile URIs
+---------------------------
+
+The Git protocol already has a capability where the Git server can list
+a set of URLs along with the packfile response when serving a client
+request. The client is then expected to download the packfiles at those
+locations in order to have a complete understanding of the response.
+
+This mechanism is used by the Gerrit server (implemented with JGit) and
+has been effective at reducing CPU load and improving user performance for
+clones.
+
+A major downside to this mechanism is that the origin server needs to know
+_exactly_ what is in those packfiles, and the packfiles need to be available
+to the user for some time after the server has responded. This coupling
+between the origin and the packfile data is difficult to manage.
+
+Further, this implementation is extremely hard to make work with fetches.
+
+Related Work: GVFS Cache Servers
+--------------------------------
+
+The GVFS Protocol [2] is a set of HTTP endpoints designed independently of
+the Git project before Git's partial clone was created. One feature of this
+protocol is the idea of a "cache server" which can be colocated with build
+machines or developer offices to transfer Git data without overloading the
+central server.
+
+The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}`
+endpoint, which allows downloading an object on-demand. This is a critical
+piece of the filesystem virtualization of that product.
+
+However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>`
+endpoint. Given an optional timestamp, the cache server responds with a list
+of precomputed packfiles containing the commits and trees that were introduced
+in those time intervals.
+
+The cache server computes these "prefetch" packfiles using the following
+strategy:
+
+1. Every hour, an "hourly" pack is generated with a given timestamp.
+2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack.
+3. Nightly, all prefetch packs more than 30 days old are rolled up into
+ one pack.
+
+When a user runs `gvfs clone` or `scalar clone` against a repo with cache
+servers, the client requests all prefetch packfiles, which is at most
+`24 + 30 + 1` packfiles downloading only commits and trees. The client
+then follows with a request to the origin server for the references, and
+attempts to checkout that tip reference. (There is an extra endpoint that
+helps get all reachable trees from a given commit, in case that commit
+was not already in a prefetch packfile.)
+
+During a `git fetch`, a hook requests the prefetch endpoint using the
+most-recent timestamp from a previously-downloaded prefetch packfile.
+Only the list of packfiles with later timestamps are downloaded. Most
+users fetch hourly, so they get at most one hourly prefetch pack. Users
+whose machines have been off or otherwise have not fetched in over 30 days
+might redownload all prefetch packfiles. This is rare.
+
+It is important to note that the clients always contact the origin server
+for the refs advertisement, so the refs are frequently "ahead" of the
+prefetched pack data. The missing objects are downloaded on-demand using
+the `GET gvfs/objects/{oid}` requests, when needed by a command such as
+`git checkout` or `git log`. Some Git optimizations disable checks that
+would cause these on-demand downloads to be too aggressive.
+
+See Also
+--------
+
+[1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
+ An earlier RFC for a bundle URI feature.
+
+[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md
+ The GVFS Protocol
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
deleted file mode 100644
index a4f1744..0000000
--- a/Documentation/technical/commit-graph-format.txt
+++ /dev/null
@@ -1,104 +0,0 @@
-Git commit graph format
-=======================
-
-The Git commit graph stores a list of commit OIDs and some associated
-metadata, including:
-
-- The generation number of the commit. Commits with no parents have
- generation number 1; commits with parents have generation number
- one more than the maximum generation number of its parents. We
- reserve zero as special, and can be used to mark a generation
- number invalid or as "not computed".
-
-- The root tree OID.
-
-- The commit date.
-
-- The parents of the commit, stored using positional references within
- the graph file.
-
-These positional references are stored as unsigned 32-bit integers
-corresponding to the array position within the list of commit OIDs. Due
-to some special constants we use to track parents, we can store at most
-(1 << 30) + (1 << 29) + (1 << 28) - 1 (around 1.8 billion) commits.
-
-== Commit graph files have the following format:
-
-In order to allow extensions that add extra data to the graph, we organize
-the body into "chunks" and provide a binary lookup table at the beginning
-of the body. The header includes certain values, such as number of chunks
-and hash type.
-
-All 4-byte numbers are in network order.
-
-HEADER:
-
- 4-byte signature:
- The signature is: {'C', 'G', 'P', 'H'}
-
- 1-byte version number:
- Currently, the only valid version is 1.
-
- 1-byte Hash Version (1 = SHA-1)
- We infer the hash length (H) from this value.
-
- 1-byte number (C) of "chunks"
-
- 1-byte number (B) of base commit-graphs
- We infer the length (H*B) of the Base Graphs chunk
- from this value.
-
-CHUNK LOOKUP:
-
- (C + 1) * 12 bytes listing the table of contents for the chunks:
- First 4 bytes describe the chunk id. Value 0 is a terminating label.
- Other 8 bytes provide the byte-offset in current file for chunk to
- start. (Chunks are ordered contiguously in the file, so you can infer
- the length using the next chunk position if necessary.) Each chunk
- ID appears at most once.
-
- The remaining data in the body is described one chunk at a time, and
- these chunks may be given in any order. Chunks are required unless
- otherwise specified.
-
-CHUNK DATA:
-
- OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
- The ith entry, F[i], stores the number of OIDs with first
- byte at most i. Thus F[255] stores the total
- number of commits (N).
-
- OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
- The OIDs for all commits in the graph, sorted in ascending order.
-
- Commit Data (ID: {'C', 'D', 'A', 'T' }) (N * (H + 16) bytes)
- * The first H bytes are for the OID of the root tree.
- * The next 8 bytes are for the positions of the first two parents
- of the ith commit. Stores value 0x7000000 if no parent in that
- position. If there are more than two parents, the second value
- has its most-significant bit on and the other bits store an array
- position into the Extra Edge List chunk.
- * The next 8 bytes store the generation number of the commit and
- the commit time in seconds since EPOCH. The generation number
- uses the higher 30 bits of the first 4 bytes, while the commit
- time uses the 32 bits of the second 4 bytes, along with the lowest
- 2 bits of the lowest byte, storing the 33rd and 34th bit of the
- commit time.
-
- Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
- This list of 4-byte values store the second through nth parents for
- all octopus merges. The second parent value in the commit data stores
- an array position within this list along with the most-significant bit
- on. Starting at that array position, iterate through this list of commit
- positions for the parents until reaching a value with the most-significant
- bit on. The other bits correspond to the position of the last parent.
-
- Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
- This list of H-byte hashes describe a set of B commit-graph files that
- form a commit-graph chain. The graph position for the ith commit in this
- file's OID Lookup chunk is equal to i plus the number of commits in all
- base graphs. If B is non-zero, this chunk must exist.
-
-TRAILER:
-
- H-byte HASH-checksum of all of the above.
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index 729fbcb..2c26e95 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -1,4 +1,4 @@
-Git Commit Graph Design Notes
+Git Commit-Graph Design Notes
=============================
Git walks the commit graph for many reasons, including:
@@ -17,16 +17,16 @@ There are two main costs here:
The commit-graph file is a supplemental data structure that accelerates
commit graph walks. If a user downgrades or disables the 'core.commitGraph'
-config setting, then the existing ODB is sufficient. The file is stored
+config setting, then the existing object database is sufficient. The file is stored
as "commit-graph" either in the .git/objects/info directory or in the info
directory of an alternate.
The commit-graph file stores the commit graph structure along with some
-extra metadata to speed up graph walks. By listing commit OIDs in lexi-
-cographic order, we can identify an integer position for each commit and
-refer to the parents of a commit using those integer positions. We use
-binary search to find initial commits and then use the integer positions
-for fast lookups during the walk.
+extra metadata to speed up graph walks. By listing commit OIDs in
+lexicographic order, we can identify an integer position for each commit
+and refer to the parents of a commit using those integer positions. We
+use binary search to find initial commits and then use the integer
+positions for fast lookups during the walk.
A consumer may load the following info for a commit from the graph:
@@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
Values 1-4 satisfy the requirements of parse_commit_gently().
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates (generation number v2)
+2. Topological levels (generation number v1)
- * A commit with no parents (a root commit) has generation number one.
+Define "corrected committer date" of a commit recursively as follows:
- * A commit with at least one parent has generation number one more than
- the largest generation number among its parents.
+ * A commit with no parents (a root commit) has corrected committer date
+ equal to its committer date.
-Equivalently, the generation number of a commit A is one more than the
+ * A commit with at least one parent has corrected committer date equal to
+ the maximum of its committer date and one more than the largest corrected
+ committer date among its parents.
+
+ * As a special case, a root commit with timestamp zero has corrected commit
+ date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
+ (that is, an uncomputed corrected commit date).
+
+Define the "topological level" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has topological level of one.
+
+ * A commit with at least one parent has topological level one more than
+ the largest topological level among its parents.
+
+Equivalently, the topological level of a commit A is one more than the
length of a longest path from A to a root commit. The recursive definition
is easier to use for computation and observing the following property:
@@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
generation numbers, then we always expand the boundary commit with highest
generation number and can easily detect the stopping condition.
+The property applies to both versions of generation number, that is both
+corrected committer dates and topological levels.
+
This property can be used to significantly reduce the time it takes to
walk commits and determine topological relationships. Without generation
numbers, the general heuristic is the following:
@@ -67,17 +87,19 @@ numbers, the general heuristic is the following:
If A and B are commits with commit time X and Y, respectively, and
X < Y, then A _probably_ cannot reach B.
-This heuristic is currently used whenever the computation is allowed to
+In absence of corrected commit dates (for example, old versions of Git or
+mixed generation graph chains),
+this heuristic is currently used whenever the computation is allowed to
violate topological relationships due to clock skew (such as "git log"
with default order), but is not used when the topological order is
required (such as merge base calculations, "git log --graph").
In practice, we expect some commits to be created recently and not stored
-in the commit graph. We can treat these commits as having "infinite"
+in the commit-graph. We can treat these commits as having "infinite"
generation number and walk until reaching commits with known generation
number.
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
in the commit-graph file. If a commit-graph file was written by a version
of Git that did not compute generation numbers, then those commits will
have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -85,7 +107,7 @@ have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
Since the commit-graph file is closed under reachability, we can guarantee
the following weaker condition on all commits:
- If A and B are commits with generation numbers N amd M, respectively,
+ If A and B are commits with generation numbers N and M, respectively,
and N < M, then A cannot reach B.
Note how the strict inequality differs from the inequality when we have
@@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
walking a few extra commits, but the simplicity in dealing with commits
with generation number *_INFINITY or *_ZERO is valuable.
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
-generation numbers are computed to be at least this value. We limit at
-this value since it is the largest value that can be stored in the
-commit-graph file using the 30 bits available to generation numbers. This
-presents another case where a commit can have generation number equal to
-that of a parent.
+We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
+topological levels (generation number v1) are computed to be at least
+this value. We limit at this value since it is the largest value that
+can be stored in the commit-graph file using the 30 bits available
+to topological levels. This presents another case where a commit can
+have generation number equal to that of a parent.
Design Details
--------------
@@ -114,7 +136,7 @@ Design Details
- Commit grafts and replace objects can change the shape of the commit
history. The latter can also be enabled/disabled on the fly using
- `--no-replace-objects`. This leads to difficultly storing both possible
+ `--no-replace-objects`. This leads to difficulty storing both possible
interpretations of a commit id, especially when computing generation
numbers. The commit-graph will not be read or written when
replace-objects or grafts are present.
@@ -127,7 +149,7 @@ Design Details
helpful for these clones, anyway. The commit-graph will not be read or
written when shallow commits are present.
-Commit Graphs Chains
+Commit-Graphs Chains
--------------------
Typically, repos grow with near-constant velocity (commits per day). Over time,
@@ -210,12 +232,12 @@ file.
+---------------------+
| |
+-----------------------+ +---------------------+
- | graph-{hash2} |->| |
+ | graph-{hash2} |->| |
+-----------------------+ +---------------------+
| | |
+-----------------------+ +---------------------+
| | | |
- | graph-{hash1} |->| |
+ | graph-{hash1} |->| |
| | | |
+-----------------------+ +---------------------+
| tmp_graphXXX
@@ -223,7 +245,7 @@ file.
| |
| |
| |
- | graph-{hash0} |
+ | graph-{hash0} |
| |
| |
| |
@@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
number of commits) could be extracted into config settings for full
flexibility.
+## Handling Mixed Generation Number Chains
+
+With the introduction of generation number v2 and generation data chunk, the
+following scenario is possible:
+
+1. "New" Git writes a commit-graph with the corrected commit dates.
+2. "Old" Git writes a split commit-graph on top without corrected commit dates.
+
+A naive approach of using the newest available generation number from
+each layer would lead to violated expectations: the lower layer would
+use corrected commit dates which are much larger than the topological
+levels of the higher layer. For this reason, Git inspects the topmost
+layer to see if the layer is missing corrected commit dates. In such a case
+Git only uses topological level for generation numbers.
+
+When writing a new layer in split commit-graph, we write corrected commit
+dates if the topmost layer has corrected commit dates written. This
+guarantees that if a layer has corrected commit dates, all lower layers
+must have corrected commit dates as well.
+
+When merging layers, we do not consider whether the merged layers had corrected
+commit dates. Instead, the new layer will have corrected commit dates if the
+layer below the new layer has corrected commit dates.
+
+While writing or merging layers, if the new layer is the only layer, it will
+have corrected commit dates when written by compatible versions of Git. Thus,
+rewriting split commit-graph as a single file (`--split=replace`) creates a
+single layer with corrected commit dates.
+
## Deleting graph-{hash} files
After a new tip file is written, some `graph-{hash}` files may no longer
@@ -323,14 +374,14 @@ Related Links
[0] https://bugs.chromium.org/p/git/issues/detail?id=8
Chromium work item for: Serialized Commit Graph
-[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
+[1] https://lore.kernel.org/git/20110713070517.GC18566@sigill.intra.peff.net/
An abandoned patch that introduced generation numbers.
-[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
+[2] https://lore.kernel.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
Discussion about generation numbers on commits and how they interact
with fsck.
-[3] https://public-inbox.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/
+[3] https://lore.kernel.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/
More discussion about generation numbers and not storing them inside
commit objects. A valuable quote:
@@ -342,9 +393,9 @@ Related Links
commit objects (i.e., packv4 or something like the "metapacks" I
proposed a few years ago)."
-[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
+[4] https://lore.kernel.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
A patch to remove the ahead-behind calculation from 'status'.
-[5] https://public-inbox.org/git/f27db281-abad-5043-6d71-cbb083b1c877@gmail.com/
+[5] https://lore.kernel.org/git/f27db281-abad-5043-6d71-cbb083b1c877@gmail.com/
A discussion of a "two-dimensional graph position" that can allow reading
multiple commit-graph chains at the same time.
diff --git a/Documentation/technical/directory-rename-detection.txt b/Documentation/technical/directory-rename-detection.txt
index 844629c..029ee2c 100644
--- a/Documentation/technical/directory-rename-detection.txt
+++ b/Documentation/technical/directory-rename-detection.txt
@@ -2,9 +2,9 @@ Directory rename detection
==========================
Rename detection logic in diffcore-rename that checks for renames of
-individual files is aggregated and analyzed in merge-recursive for cases
-where combinations of renames indicate that a full directory has been
-renamed.
+individual files is also aggregated there and then analyzed in either
+merge-ort or merge-recursive for cases where combinations of renames
+indicate that a full directory has been renamed.
Scope of abilities
------------------
@@ -18,7 +18,8 @@ It is perhaps easiest to start with an example:
More interesting possibilities exist, though, such as:
* one side of history renames x -> z, and the other renames some file to
- x/e, causing the need for the merge to do a transitive rename.
+ x/e, causing the need for the merge to do a transitive rename so that
+ the rename ends up at z/e.
* one side of history renames x -> z, but also renames all files within x.
For example, x/a -> z/alpha, x/b -> z/bravo, etc.
@@ -35,7 +36,7 @@ More interesting possibilities exist, though, such as:
directory itself contained inner directories that were renamed to yet
other locations).
- * combinations of the above; see t/t6043-merge-rename-directories.sh for
+ * combinations of the above; see t/t6423-merge-rename-directories.sh for
various interesting cases.
Limitations -- applicability of directory renames
@@ -62,19 +63,19 @@ directory rename detection applies:
Limitations -- detailed rules and testcases
-------------------------------------------
-t/t6043-merge-rename-directories.sh contains extensive tests and commentary
+t/t6423-merge-rename-directories.sh contains extensive tests and commentary
which generate and explore the rules listed above. It also lists a few
additional rules:
a) If renames split a directory into two or more others, the directory
with the most renames, "wins".
- b) Avoid directory-rename-detection for a path, if that path is the
- source of a rename on either side of a merge.
-
- c) Only apply implicit directory renames to directories if the other side
+ b) Only apply implicit directory renames to directories if the other side
of history is the one doing the renaming.
+ c) Do not perform directory rename detection for directories which had no
+ new paths added to them.
+
Limitations -- support in different commands
--------------------------------------------
@@ -87,9 +88,11 @@ directory rename detection support in:
Folks have requested in the past that `git diff` detect directory
renames and somehow simplify its output. It is not clear whether this
would be desirable or how the output should be simplified, so this was
- simply not implemented. Further, to implement this, directory rename
- detection logic would need to move from merge-recursive to
- diffcore-rename.
+ simply not implemented. Also, while diffcore-rename has most of the
+ logic for detecting directory renames, some of the logic is still found
+ within merge-ort and merge-recursive. Fully supporting directory
+ rename detection in diffs would require copying or moving the remaining
+ bits of logic to the diff machinery.
* am
diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
index 2ae8fa4..ed57481 100644
--- a/Documentation/technical/hash-function-transition.txt
+++ b/Documentation/technical/hash-function-transition.txt
@@ -33,16 +33,9 @@ researchers. On 23 February 2017 the SHAttered attack
Git v2.13.0 and later subsequently moved to a hardened SHA-1
implementation by default, which isn't vulnerable to the SHAttered
-attack.
+attack, but SHA-1 is still weak.
-Thus Git has in effect already migrated to a new hash that isn't SHA-1
-and doesn't share its vulnerabilities, its new hash function just
-happens to produce exactly the same output for all known inputs,
-except two PDFs published by the SHAttered researchers, and the new
-implementation (written by those researchers) claims to detect future
-cryptanalytic collision attacks.
-
-Regardless, it's considered prudent to move past any variant of SHA-1
+Thus it's considered prudent to move past any variant of SHA-1
to a new hash. There's no guarantee that future attacks on SHA-1 won't
be published in the future, and those attacks may not have viable
mitigations.
@@ -57,6 +50,38 @@ SHA-1 still possesses the other properties such as fast object lookup
and safe error checking, but other hash functions are equally suitable
that are believed to be cryptographically secure.
+Choice of Hash
+--------------
+The hash to replace the hardened SHA-1 should be stronger than SHA-1
+was: we would like it to be trustworthy and useful in practice for at
+least 10 years.
+
+Some other relevant properties:
+
+1. A 256-bit hash (long enough to match common security practice; not
+ excessively long to hurt performance and disk usage).
+
+2. High quality implementations should be widely available (e.g., in
+ OpenSSL and Apple CommonCrypto).
+
+3. The hash function's properties should match Git's needs (e.g. Git
+ requires collision and 2nd preimage resistance and does not require
+ length extension resistance).
+
+4. As a tiebreaker, the hash should be fast to compute (fortunately
+ many contenders are faster than SHA-1).
+
+There were several contenders for a successor hash to SHA-1, including
+SHA-256, SHA-512/256, SHA-256x16, K12, and BLAKE2bp-256.
+
+In late 2018 the project picked SHA-256 as its successor hash.
+
+See 0ed8d8da374 (doc hash-function-transition: pick SHA-256 as
+NewHash, 2018-08-04) and numerous mailing list threads at the time,
+particularly the one starting at
+https://lore.kernel.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/
+for more information.
+
Goals
-----
1. The transition to SHA-256 can be done one local repository at a time.
@@ -94,7 +119,7 @@ Overview
--------
We introduce a new repository format extension. Repositories with this
extension enabled use SHA-256 instead of SHA-1 to name their objects.
-This affects both object names and object content --- both the names
+This affects both object names and object content -- both the names
of objects and all references to other objects within an object are
switched to the new hash function.
@@ -107,7 +132,7 @@ mapping to allow naming objects using either their SHA-1 and SHA-256 names
interchangeably.
"git cat-file" and "git hash-object" gain options to display an object
-in its sha1 form and write an object given its sha1 form. This
+in its SHA-1 form and write an object given its SHA-1 form. This
requires all objects referenced by that object to be present in the
object database so that they can be named using the appropriate name
(using the bidirectional hash mapping).
@@ -115,7 +140,7 @@ object database so that they can be named using the appropriate name
Fetches from a SHA-1 based server convert the fetched objects into
SHA-256 form and record the mapping in the bidirectional mapping table
(see below for details). Pushes to a SHA-1 based server convert the
-objects being pushed into sha1 form so the server does not have to be
+objects being pushed into SHA-1 form so the server does not have to be
aware of the hash function the client is using.
Detailed Design
@@ -151,38 +176,38 @@ repository extensions.
Object names
~~~~~~~~~~~~
-Objects can be named by their 40 hexadecimal digit sha1-name or 64
-hexadecimal digit sha256-name, plus names derived from those (see
+Objects can be named by their 40 hexadecimal digit SHA-1 name or 64
+hexadecimal digit SHA-256 name, plus names derived from those (see
gitrevisions(7)).
-The sha1-name of an object is the SHA-1 of the concatenation of its
-type, length, a nul byte, and the object's sha1-content. This is the
+The SHA-1 name of an object is the SHA-1 of the concatenation of its
+type, length, a nul byte, and the object's SHA-1 content. This is the
traditional <sha1> used in Git to name objects.
-The sha256-name of an object is the SHA-256 of the concatenation of its
-type, length, a nul byte, and the object's sha256-content.
+The SHA-256 name of an object is the SHA-256 of the concatenation of its
+type, length, a nul byte, and the object's SHA-256 content.
Object format
~~~~~~~~~~~~~
The content as a byte sequence of a tag, commit, or tree object named
-by sha1 and sha256 differ because an object named by sha256-name refers to
-other objects by their sha256-names and an object named by sha1-name
-refers to other objects by their sha1-names.
+by SHA-1 and SHA-256 differ because an object named by SHA-256 name refers to
+other objects by their SHA-256 names and an object named by SHA-1 name
+refers to other objects by their SHA-1 names.
-The sha256-content of an object is the same as its sha1-content, except
-that objects referenced by the object are named using their sha256-names
-instead of sha1-names. Because a blob object does not refer to any
-other object, its sha1-content and sha256-content are the same.
+The SHA-256 content of an object is the same as its SHA-1 content, except
+that objects referenced by the object are named using their SHA-256 names
+instead of SHA-1 names. Because a blob object does not refer to any
+other object, its SHA-1 content and SHA-256 content are the same.
-The format allows round-trip conversion between sha256-content and
-sha1-content.
+The format allows round-trip conversion between SHA-256 content and
+SHA-1 content.
Object storage
~~~~~~~~~~~~~~
Loose objects use zlib compression and packed objects use the packed
-format described in Documentation/technical/pack-format.txt, just like
-today. The content that is compressed and stored uses sha256-content
-instead of sha1-content.
+format described in linkgit:gitformat-pack[5], just like
+today. The content that is compressed and stored uses SHA-256 content
+instead of SHA-1 content.
Pack index
~~~~~~~~~~
@@ -191,21 +216,21 @@ hash functions. They have the following format (all integers are in
network byte order):
- A header appears at the beginning and consists of the following:
- - The 4-byte pack index signature: '\377t0c'
- - 4-byte version number: 3
- - 4-byte length of the header section, including the signature and
+ * The 4-byte pack index signature: '\377t0c'
+ * 4-byte version number: 3
+ * 4-byte length of the header section, including the signature and
version number
- - 4-byte number of objects contained in the pack
- - 4-byte number of object formats in this pack index: 2
- - For each object format:
- - 4-byte format identifier (e.g., 'sha1' for SHA-1)
- - 4-byte length in bytes of shortened object names. This is the
+ * 4-byte number of objects contained in the pack
+ * 4-byte number of object formats in this pack index: 2
+ * For each object format:
+ ** 4-byte format identifier (e.g., 'sha1' for SHA-1)
+ ** 4-byte length in bytes of shortened object names. This is the
shortest possible length needed to make names in the shortened
object name table unambiguous.
- - 4-byte integer, recording where tables relating to this format
+ ** 4-byte integer, recording where tables relating to this format
are stored in this index file, as an offset from the beginning.
- - 4-byte offset to the trailer from the beginning of this file.
- - Zero or more additional key/value pairs (4-byte key, 4-byte
+ * 4-byte offset to the trailer from the beginning of this file.
+ * Zero or more additional key/value pairs (4-byte key, 4-byte
value). Only one key is supported: 'PSRC'. See the "Loose objects
and unreachable objects" section for supported values and how this
is used. All other keys are reserved. Readers must ignore
@@ -213,37 +238,36 @@ network byte order):
- Zero or more NUL bytes. This can optionally be used to improve the
alignment of the full object name table below.
- Tables for the first object format:
- - A sorted table of shortened object names. These are prefixes of
+ * A sorted table of shortened object names. These are prefixes of
the names of all objects in this pack file, packed together
without offset values to reduce the cache footprint of the binary
search for a specific object name.
- - A table of full object names in pack order. This allows resolving
+ * A table of full object names in pack order. This allows resolving
a reference to "the nth object in the pack file" (from a
reachability bitmap or from the next table of another object
format) to its object name.
- - A table of 4-byte values mapping object name order to pack order.
+ * A table of 4-byte values mapping object name order to pack order.
For an object in the table of sorted shortened object names, the
value at the corresponding index in this table is the index in the
previous table for that same object.
-
This can be used to look up the object in reachability bitmaps or
to look up its name in another object format.
- - A table of 4-byte CRC32 values of the packed object data, in the
+ * A table of 4-byte CRC32 values of the packed object data, in the
order that the objects appear in the pack file. This is to allow
compressed data to be copied directly from pack to pack during
repacking without undetected data corruption.
- - A table of 4-byte offset values. For an object in the table of
+ * A table of 4-byte offset values. For an object in the table of
sorted shortened object names, the value at the corresponding
index in this table indicates where that object can be found in
the pack file. These are usually 31-bit pack file offsets, but
large offsets are encoded as an index into the next table with the
most significant bit set.
- - A table of 8-byte offset entries (empty for pack files less than
+ * A table of 8-byte offset entries (empty for pack files less than
2 GiB). Pack files are organized with heavily used objects toward
the front, so most object references should not need to refer to
this table.
@@ -252,10 +276,10 @@ network byte order):
up to and not including the table of CRC32 values.
- Zero or more NUL bytes.
- The trailer consists of the following:
- - A copy of the 20-byte SHA-256 checksum at the end of the
+ * A copy of the 20-byte SHA-256 checksum at the end of the
corresponding packfile.
- - 20-byte SHA-256 checksum of all of the above.
+ * 20-byte SHA-256 checksum of all of the above.
Loose object index
~~~~~~~~~~~~~~~~~~
@@ -288,18 +312,18 @@ To remove entries (e.g. in "git pack-refs" or "git-prune"):
Translation table
~~~~~~~~~~~~~~~~~
-The index files support a bidirectional mapping between sha1-names
-and sha256-names. The lookup proceeds similarly to ordinary object
-lookups. For example, to convert a sha1-name to a sha256-name:
+The index files support a bidirectional mapping between SHA-1 names
+and SHA-256 names. The lookup proceeds similarly to ordinary object
+lookups. For example, to convert a SHA-1 name to a SHA-256 name:
1. Look for the object in idx files. If a match is present in the
- idx's sorted list of truncated sha1-names, then:
- a. Read the corresponding entry in the sha1-name order to pack
+ idx's sorted list of truncated SHA-1 names, then:
+ a. Read the corresponding entry in the SHA-1 name order to pack
name order mapping.
- b. Read the corresponding entry in the full sha1-name table to
+ b. Read the corresponding entry in the full SHA-1 name table to
verify we found the right object. If it is, then
- c. Read the corresponding entry in the full sha256-name table.
- That is the object's sha256-name.
+ c. Read the corresponding entry in the full SHA-256 name table.
+ That is the object's SHA-256 name.
2. Check for a loose object. Read lines from loose-object-idx until
we find a match.
@@ -313,10 +337,10 @@ Since all operations that make new objects (e.g., "git commit") add
the new objects to the corresponding index, this mapping is possible
for all objects in the object store.
-Reading an object's sha1-content
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The sha1-content of an object can be read by converting all sha256-names
-its sha256-content references to sha1-names using the translation table.
+Reading an object's SHA-1 content
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The SHA-1 content of an object can be read by converting all SHA-256 names
+of its SHA-256 content references to SHA-1 names using the translation table.
Fetch
~~~~~
@@ -339,7 +363,7 @@ the following steps:
1. index-pack: inflate each object in the packfile and compute its
SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
objects the client has locally. These objects can be looked up
- using the translation table and their sha1-content read as
+ using the translation table and their SHA-1 content read as
described above to resolve the deltas.
2. topological sort: starting at the "want"s from the negotiation
phase, walk through objects in the pack and emit a list of them,
@@ -348,12 +372,12 @@ the following steps:
(This list only contains objects reachable from the "wants". If the
pack from the server contained additional extraneous objects, then
they will be discarded.)
-3. convert to sha256: open a new (sha256) packfile. Read the topologically
+3. convert to SHA-256: open a new SHA-256 packfile. Read the topologically
sorted list just generated. For each object, inflate its
- sha1-content, convert to sha256-content, and write it to the sha256
- pack. Record the new sha1<->sha256 mapping entry for use in the idx.
+ SHA-1 content, convert to SHA-256 content, and write it to the SHA-256
+ pack. Record the new SHA-1<-->SHA-256 mapping entry for use in the idx.
4. sort: reorder entries in the new pack to match the order of objects
- in the pack the server generated and include blobs. Write a sha256 idx
+ in the pack the server generated and include blobs. Write a SHA-256 idx
file
5. clean up: remove the SHA-1 based pack file, index, and
topologically sorted list obtained from the server in steps 1
@@ -378,19 +402,20 @@ experimenting to get this to perform well.
Push
~~~~
Push is simpler than fetch because the objects referenced by the
-pushed objects are already in the translation table. The sha1-content
+pushed objects are already in the translation table. The SHA-1 content
of each object being pushed can be read as described in the "Reading
-an object's sha1-content" section to generate the pack written by git
+an object's SHA-1 content" section to generate the pack written by git
send-pack.
Signed Commits
~~~~~~~~~~~~~~
We add a new field "gpgsig-sha256" to the commit object format to allow
signing commits without relying on SHA-1. It is similar to the
-existing "gpgsig" field. Its signed payload is the sha256-content of the
+existing "gpgsig" field. Its signed payload is the SHA-256 content of the
commit object with any "gpgsig" and "gpgsig-sha256" fields removed.
This means commits can be signed
+
1. using SHA-1 only, as in existing signed commit objects
2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig
fields.
@@ -404,10 +429,11 @@ Signed Tags
~~~~~~~~~~~
We add a new field "gpgsig-sha256" to the tag object format to allow
signing tags without relying on SHA-1. Its signed payload is the
-sha256-content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP
+SHA-256 content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP
SIGNATURE-----" delimited in-body signature removed.
This means tags can be signed
+
1. using SHA-1 only, as in existing signed tag objects
2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body
signature.
@@ -415,11 +441,11 @@ This means tags can be signed
Mergetag embedding
~~~~~~~~~~~~~~~~~~
-The mergetag field in the sha1-content of a commit contains the
-sha1-content of a tag that was merged by that commit.
+The mergetag field in the SHA-1 content of a commit contains the
+SHA-1 content of a tag that was merged by that commit.
-The mergetag field in the sha256-content of the same commit contains the
-sha256-content of the same tag.
+The mergetag field in the SHA-256 content of the same commit contains the
+SHA-256 content of the same tag.
Submodules
~~~~~~~~~~
@@ -494,7 +520,7 @@ Caveats
-------
Invalid objects
~~~~~~~~~~~~~~~
-The conversion from sha1-content to sha256-content retains any
+The conversion from SHA-1 content to SHA-256 content retains any
brokenness in the original object (e.g., tree entry modes encoded with
leading 0, tree objects whose paths are not sorted correctly, and
commit objects without an author or committer). This is a deliberate
@@ -513,15 +539,15 @@ allow lifting this restriction.
Alternates
~~~~~~~~~~
-For the same reason, a sha256 repository cannot borrow objects from a
-sha1 repository using objects/info/alternates or
+For the same reason, a SHA-256 repository cannot borrow objects from a
+SHA-1 repository using objects/info/alternates or
$GIT_ALTERNATE_OBJECT_REPOSITORIES.
git notes
~~~~~~~~~
-The "git notes" tool annotates objects using their sha1-name as key.
+The "git notes" tool annotates objects using their SHA-1 name as key.
This design does not describe a way to migrate notes trees to use
-sha256-names. That migration is expected to happen separately (for
+SHA-256 names. That migration is expected to happen separately (for
example using a file at the root of the notes tree to describe which
hash it uses).
@@ -531,12 +557,12 @@ Until Git protocol gains SHA-256 support, using SHA-256 based storage
on public-facing Git servers is strongly discouraged. Once Git
protocol gains SHA-256 support, SHA-256 based servers are likely not
to support SHA-1 compatibility, to avoid what may be a very expensive
-hash reencode during clone and to encourage peers to modernize.
+hash re-encode during clone and to encourage peers to modernize.
The design described here allows fetches by SHA-1 clients of a
personal SHA-256 repository because it's not much more difficult than
allowing pushes from that repository. This support needs to be guarded
-by a configuration option --- servers like git.kernel.org that serve a
+by a configuration option -- servers like git.kernel.org that serve a
large number of clients would not be expected to bear that cost.
Meaning of signatures
@@ -555,7 +581,7 @@ unclear:
Git 2.12
-Does this mean Git v2.12.0 is the commit with sha1-name
+Does this mean Git v2.12.0 is the commit with SHA-1 name
e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?
@@ -573,7 +599,7 @@ supports four different modes of operation:
convert any object names written to output to SHA-1, but store
objects using SHA-256. This allows users to test the code with no
visible behavior change except for performance. This allows
- allows running even tests that assume the SHA-1 hash function, to
+ running even tests that assume the SHA-1 hash function, to
sanity-check the behavior of the new mode.
2. ("early transition") Allow both SHA-1 and SHA-256 object names in
@@ -598,44 +624,12 @@ The user can also explicitly specify which format to use for a
particular revision specifier and for output, overriding the mode. For
example:
-git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}
-
-Choice of Hash
---------------
-In early 2005, around the time that Git was written, Xiaoyun Wang,
-Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1
-collisions in 2^69 operations. In August they published details.
-Luckily, no practical demonstrations of a collision in full SHA-1 were
-published until 10 years later, in 2017.
-
-Git v2.13.0 and later subsequently moved to a hardened SHA-1
-implementation by default that mitigates the SHAttered attack, but
-SHA-1 is still believed to be weak.
-
-The hash to replace this hardened SHA-1 should be stronger than SHA-1
-was: we would like it to be trustworthy and useful in practice for at
-least 10 years.
-
-Some other relevant properties:
-
-1. A 256-bit hash (long enough to match common security practice; not
- excessively long to hurt performance and disk usage).
-
-2. High quality implementations should be widely available (e.g., in
- OpenSSL and Apple CommonCrypto).
-
-3. The hash function's properties should match Git's needs (e.g. Git
- requires collision and 2nd preimage resistance and does not require
- length extension resistance).
-
-4. As a tiebreaker, the hash should be fast to compute (fortunately
- many contenders are faster than SHA-1).
-
-We choose SHA-256.
+ git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}
Transition plan
---------------
Some initial steps can be implemented independently of one another:
+
- adding a hash function API (vtable)
- teaching fsck to tolerate the gpgsig-sha256 field
- excluding gpgsig-* from the fields copied by "git commit --amend"
@@ -647,10 +641,9 @@ Some initial steps can be implemented independently of one another:
- introducing index v3
- adding support for the PSRC field and safer object pruning
-
The first user-visible change is the introduction of the objectFormat
extension (without compatObjectFormat). This requires:
-- implementing the loose-object-idx
+
- teaching fsck about this mode of operation
- using the hash function API (vtable) when computing object names
- signing objects and verifying signatures
@@ -658,6 +651,8 @@ extension (without compatObjectFormat). This requires:
repository
Next comes introduction of compatObjectFormat:
+
+- implementing the loose-object-idx
- translating object names between object formats
- translating object content between object formats
- generating and verifying signatures in the compat format
@@ -669,10 +664,11 @@ Next comes introduction of compatObjectFormat:
"Object names on the command line" above)
The next step is supporting fetches and pushes to SHA-1 repositories:
+
- allow pushes to a repository using the compat format
- generate a topologically sorted list of the SHA-1 names of fetched
objects
-- convert the fetched packfile to sha256 format and generate an idx
+- convert the fetched packfile to SHA-256 format and generate an idx
file
- re-sort to match the order of objects in the fetched packfile
@@ -730,10 +726,11 @@ adoption.
Using hash functions in parallel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-(e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ )
+(e.g. https://lore.kernel.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ )
Objects newly created would be addressed by the new hash, but inside
such an object (e.g. commit) it is still possible to address objects
using the old hash function.
+
* You cannot trust its history (needed for bisectability) in the
future without further work
* Maintenance burden as the number of supported hash functions grows
@@ -743,36 +740,38 @@ using the old hash function.
Signed objects with multiple hashes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Instead of introducing the gpgsig-sha256 field in commit and tag objects
-for sha256-content based signatures, an earlier version of this design
-added "hash sha256 <sha256-name>" fields to strengthen the existing
-sha1-content based signatures.
+for SHA-256 content based signatures, an earlier version of this design
+added "hash sha256 <SHA-256 name>" fields to strengthen the existing
+SHA-1 content based signatures.
In other words, a single signature was used to attest to the object
content using both hash functions. This had some advantages:
+
* Using one signature instead of two speeds up the signing process.
* Having one signed payload with both hashes allows the signer to
- attest to the sha1-name and sha256-name referring to the same object.
+ attest to the SHA-1 name and SHA-256 name referring to the same object.
* All users consume the same signature. Broken signatures are likely
to be detected quickly using current versions of git.
However, it also came with disadvantages:
-* Verifying a signed object requires access to the sha1-names of all
+
+* Verifying a signed object requires access to the SHA-1 names of all
objects it references, even after the transition is complete and
translation table is no longer needed for anything else. To support
- this, the design added fields such as "hash sha1 tree <sha1-name>"
- and "hash sha1 parent <sha1-name>" to the sha256-content of a signed
+ this, the design added fields such as "hash sha1 tree <SHA-1 name>"
+ and "hash sha1 parent <SHA-1 name>" to the SHA-256 content of a signed
commit, complicating the conversion process.
-* Allowing signed objects without a sha1 (for after the transition is
+* Allowing signed objects without a SHA-1 (for after the transition is
complete) complicated the design further, requiring a "nohash sha1"
- field to suppress including "hash sha1" fields in the sha256-content
+ field to suppress including "hash sha1" fields in the SHA-256 content
and signed payload.
Lazily populated translation table
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some of the work of building the translation table could be deferred to
push time, but that would significantly complicate and slow down pushes.
-Calculating the sha1-name at object creation time at the same time it is
-being streamed to disk and having its sha256-name calculated should be
+Calculating the SHA-1 name at object creation time at the same time it is
+being streamed to disk and having its SHA-256 name calculated should be
an acceptable cost.
Document History
@@ -782,18 +781,19 @@ Document History
bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com,
sbeller@google.com
-Initial version sent to
-http://public-inbox.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com
+* Initial version sent to https://lore.kernel.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com
2017-03-03 jrnieder@gmail.com
Incorporated suggestions from jonathantanmy and sbeller:
-* describe purpose of signed objects with each hash type
-* redefine signed object verification using object content under the
+
+* Describe purpose of signed objects with each hash type
+* Redefine signed object verification using object content under the
first hash function
2017-03-06 jrnieder@gmail.com
+
* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2]
-* Make sha3-based signatures a separate field, avoiding the need for
+* Make SHA3-based signatures a separate field, avoiding the need for
"hash" and "nohash" fields (thanks to peff[3]).
* Add a sorting phase to fetch (thanks to Junio for noticing the need
for this).
@@ -805,23 +805,26 @@ Incorporated suggestions from jonathantanmy and sbeller:
especially Junio).
2017-09-27 jrnieder@gmail.com, sbeller@google.com
-* use placeholder NewHash instead of SHA3-256
-* describe criteria for picking a hash function.
-* include a transition plan (thanks especially to Brandon Williams
+
+* Use placeholder NewHash instead of SHA3-256
+* Describe criteria for picking a hash function.
+* Include a transition plan (thanks especially to Brandon Williams
for fleshing these ideas out)
-* define the translation table (thanks, Shawn Pearce[5], Jonathan
+* Define the translation table (thanks, Shawn Pearce[5], Jonathan
Tan, and Masaya Suzuki)
-* avoid loose object overhead by packing more aggressively in
+* Avoid loose object overhead by packing more aggressively in
"git gc --auto"
Later history:
- See the history of this file in git.git for the history of subsequent
- edits. This document history is no longer being maintained as it
- would now be superfluous to the commit log
+* See the history of this file in git.git for the history of subsequent
+ edits. This document history is no longer being maintained as it
+ would now be superfluous to the commit log
+
+References:
-[1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/
-[2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/
-[3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/
-[4] http://public-inbox.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net
-[5] https://public-inbox.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/
+ [1] https://lore.kernel.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/
+ [2] https://lore.kernel.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/
+ [3] https://lore.kernel.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/
+ [4] https://lore.kernel.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net
+ [5] https://lore.kernel.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/
diff --git a/Documentation/technical/http-protocol.txt b/Documentation/technical/http-protocol.txt
deleted file mode 100644
index 9c5b6f0..0000000
--- a/Documentation/technical/http-protocol.txt
+++ /dev/null
@@ -1,518 +0,0 @@
-HTTP transfer protocols
-=======================
-
-Git supports two HTTP based transfer protocols. A "dumb" protocol
-which requires only a standard HTTP server on the server end of the
-connection, and a "smart" protocol which requires a Git aware CGI
-(or server module). This document describes both protocols.
-
-As a design feature smart clients can automatically upgrade "dumb"
-protocol URLs to smart URLs. This permits all users to have the
-same published URL, and the peers automatically select the most
-efficient transport available to them.
-
-
-URL Format
-----------
-
-URLs for Git repositories accessed by HTTP use the standard HTTP
-URL syntax documented by RFC 1738, so they are of the form:
-
- http://<host>:<port>/<path>?<searchpart>
-
-Within this documentation the placeholder `$GIT_URL` will stand for
-the http:// repository URL entered by the end-user.
-
-Servers SHOULD handle all requests to locations matching `$GIT_URL`, as
-both the "smart" and "dumb" HTTP protocols used by Git operate
-by appending additional path components onto the end of the user
-supplied `$GIT_URL` string.
-
-An example of a dumb client requesting for a loose object:
-
- $GIT_URL: http://example.com:8080/git/repo.git
- URL request: http://example.com:8080/git/repo.git/objects/d0/49f6c27a2244e12041955e262a404c7faba355
-
-An example of a smart request to a catch-all gateway:
-
- $GIT_URL: http://example.com/daemon.cgi?svc=git&q=
- URL request: http://example.com/daemon.cgi?svc=git&q=/info/refs&service=git-receive-pack
-
-An example of a request to a submodule:
-
- $GIT_URL: http://example.com/git/repo.git/path/submodule.git
- URL request: http://example.com/git/repo.git/path/submodule.git/info/refs
-
-Clients MUST strip a trailing `/`, if present, from the user supplied
-`$GIT_URL` string to prevent empty path tokens (`//`) from appearing
-in any URL sent to a server. Compatible clients MUST expand
-`$GIT_URL/info/refs` as `foo/info/refs` and not `foo//info/refs`.
-
-
-Authentication
---------------
-
-Standard HTTP authentication is used if authentication is required
-to access a repository, and MAY be configured and enforced by the
-HTTP server software.
-
-Because Git repositories are accessed by standard path components
-server administrators MAY use directory based permissions within
-their HTTP server to control repository access.
-
-Clients SHOULD support Basic authentication as described by RFC 2617.
-Servers SHOULD support Basic authentication by relying upon the
-HTTP server placed in front of the Git server software.
-
-Servers SHOULD NOT require HTTP cookies for the purposes of
-authentication or access control.
-
-Clients and servers MAY support other common forms of HTTP based
-authentication, such as Digest authentication.
-
-
-SSL
----
-
-Clients and servers SHOULD support SSL, particularly to protect
-passwords when relying on Basic HTTP authentication.
-
-
-Session State
--------------
-
-The Git over HTTP protocol (much like HTTP itself) is stateless
-from the perspective of the HTTP server side. All state MUST be
-retained and managed by the client process. This permits simple
-round-robin load-balancing on the server side, without needing to
-worry about state management.
-
-Clients MUST NOT require state management on the server side in
-order to function correctly.
-
-Servers MUST NOT require HTTP cookies in order to function correctly.
-Clients MAY store and forward HTTP cookies during request processing
-as described by RFC 2616 (HTTP/1.1). Servers SHOULD ignore any
-cookies sent by a client.
-
-
-General Request Processing
---------------------------
-
-Except where noted, all standard HTTP behavior SHOULD be assumed
-by both client and server. This includes (but is not necessarily
-limited to):
-
-If there is no repository at `$GIT_URL`, or the resource pointed to by a
-location matching `$GIT_URL` does not exist, the server MUST NOT respond
-with `200 OK` response. A server SHOULD respond with
-`404 Not Found`, `410 Gone`, or any other suitable HTTP status code
-which does not imply the resource exists as requested.
-
-If there is a repository at `$GIT_URL`, but access is not currently
-permitted, the server MUST respond with the `403 Forbidden` HTTP
-status code.
-
-Servers SHOULD support both HTTP 1.0 and HTTP 1.1.
-Servers SHOULD support chunked encoding for both request and response
-bodies.
-
-Clients SHOULD support both HTTP 1.0 and HTTP 1.1.
-Clients SHOULD support chunked encoding for both request and response
-bodies.
-
-Servers MAY return ETag and/or Last-Modified headers.
-
-Clients MAY revalidate cached entities by including If-Modified-Since
-and/or If-None-Match request headers.
-
-Servers MAY return `304 Not Modified` if the relevant headers appear
-in the request and the entity has not changed. Clients MUST treat
-`304 Not Modified` identical to `200 OK` by reusing the cached entity.
-
-Clients MAY reuse a cached entity without revalidation if the
-Cache-Control and/or Expires header permits caching. Clients and
-servers MUST follow RFC 2616 for cache controls.
-
-
-Discovering References
-----------------------
-
-All HTTP clients MUST begin either a fetch or a push exchange by
-discovering the references available on the remote repository.
-
-Dumb Clients
-~~~~~~~~~~~~
-
-HTTP clients that only support the "dumb" protocol MUST discover
-references by making a request for the special info/refs file of
-the repository.
-
-Dumb HTTP clients MUST make a `GET` request to `$GIT_URL/info/refs`,
-without any search/query parameters.
-
- C: GET $GIT_URL/info/refs HTTP/1.0
-
- S: 200 OK
- S:
- S: 95dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint
- S: d049f6c27a2244e12041955e262a404c7faba355 refs/heads/master
- S: 2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0
- S: a3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{}
-
-The Content-Type of the returned info/refs entity SHOULD be
-`text/plain; charset=utf-8`, but MAY be any content type.
-Clients MUST NOT attempt to validate the returned Content-Type.
-Dumb servers MUST NOT return a return type starting with
-`application/x-git-`.
-
-Cache-Control headers MAY be returned to disable caching of the
-returned entity.
-
-When examining the response clients SHOULD only examine the HTTP
-status code. Valid responses are `200 OK`, or `304 Not Modified`.
-
-The returned content is a UNIX formatted text file describing
-each ref and its known value. The file SHOULD be sorted by name
-according to the C locale ordering. The file SHOULD NOT include
-the default ref named `HEAD`.
-
- info_refs = *( ref_record )
- ref_record = any_ref / peeled_ref
-
- any_ref = obj-id HTAB refname LF
- peeled_ref = obj-id HTAB refname LF
- obj-id HTAB refname "^{}" LF
-
-Smart Clients
-~~~~~~~~~~~~~
-
-HTTP clients that support the "smart" protocol (or both the
-"smart" and "dumb" protocols) MUST discover references by making
-a parameterized request for the info/refs file of the repository.
-
-The request MUST contain exactly one query parameter,
-`service=$servicename`, where `$servicename` MUST be the service
-name the client wishes to contact to complete the operation.
-The request MUST NOT contain additional query parameters.
-
- C: GET $GIT_URL/info/refs?service=git-upload-pack HTTP/1.0
-
-dumb server reply:
-
- S: 200 OK
- S:
- S: 95dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint
- S: d049f6c27a2244e12041955e262a404c7faba355 refs/heads/master
- S: 2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0
- S: a3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{}
-
-smart server reply:
-
- S: 200 OK
- S: Content-Type: application/x-git-upload-pack-advertisement
- S: Cache-Control: no-cache
- S:
- S: 001e# service=git-upload-pack\n
- S: 0000
- S: 004895dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint\0multi_ack\n
- S: 0042d049f6c27a2244e12041955e262a404c7faba355 refs/heads/master\n
- S: 003c2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0\n
- S: 003fa3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{}\n
- S: 0000
-
-The client may send Extra Parameters (see
-Documentation/technical/pack-protocol.txt) as a colon-separated string
-in the Git-Protocol HTTP header.
-
-Dumb Server Response
-^^^^^^^^^^^^^^^^^^^^
-Dumb servers MUST respond with the dumb server reply format.
-
-See the prior section under dumb clients for a more detailed
-description of the dumb server response.
-
-Smart Server Response
-^^^^^^^^^^^^^^^^^^^^^
-If the server does not recognize the requested service name, or the
-requested service name has been disabled by the server administrator,
-the server MUST respond with the `403 Forbidden` HTTP status code.
-
-Otherwise, smart servers MUST respond with the smart server reply
-format for the requested service name.
-
-Cache-Control headers SHOULD be used to disable caching of the
-returned entity.
-
-The Content-Type MUST be `application/x-$servicename-advertisement`.
-Clients SHOULD fall back to the dumb protocol if another content
-type is returned. When falling back to the dumb protocol clients
-SHOULD NOT make an additional request to `$GIT_URL/info/refs`, but
-instead SHOULD use the response already in hand. Clients MUST NOT
-continue if they do not support the dumb protocol.
-
-Clients MUST validate the status code is either `200 OK` or
-`304 Not Modified`.
-
-Clients MUST validate the first five bytes of the response entity
-matches the regex `^[0-9a-f]{4}#`. If this test fails, clients
-MUST NOT continue.
-
-Clients MUST parse the entire response as a sequence of pkt-line
-records.
-
-Clients MUST verify the first pkt-line is `# service=$servicename`.
-Servers MUST set $servicename to be the request parameter value.
-Servers SHOULD include an LF at the end of this line.
-Clients MUST ignore an LF at the end of the line.
-
-Servers MUST terminate the response with the magic `0000` end
-pkt-line marker.
-
-The returned response is a pkt-line stream describing each ref and
-its known value. The stream SHOULD be sorted by name according to
-the C locale ordering. The stream SHOULD include the default ref
-named `HEAD` as the first ref. The stream MUST include capability
-declarations behind a NUL on the first ref.
-
-The returned response contains "version 1" if "version=1" was sent as an
-Extra Parameter.
-
- smart_reply = PKT-LINE("# service=$servicename" LF)
- "0000"
- *1("version 1")
- ref_list
- "0000"
- ref_list = empty_list / non_empty_list
-
- empty_list = PKT-LINE(zero-id SP "capabilities^{}" NUL cap-list LF)
-
- non_empty_list = PKT-LINE(obj-id SP name NUL cap_list LF)
- *ref_record
-
- cap-list = capability *(SP capability)
- capability = 1*(LC_ALPHA / DIGIT / "-" / "_")
- LC_ALPHA = %x61-7A
-
- ref_record = any_ref / peeled_ref
- any_ref = PKT-LINE(obj-id SP name LF)
- peeled_ref = PKT-LINE(obj-id SP name LF)
- PKT-LINE(obj-id SP name "^{}" LF
-
-
-Smart Service git-upload-pack
-------------------------------
-This service reads from the repository pointed to by `$GIT_URL`.
-
-Clients MUST first perform ref discovery with
-`$GIT_URL/info/refs?service=git-upload-pack`.
-
- C: POST $GIT_URL/git-upload-pack HTTP/1.0
- C: Content-Type: application/x-git-upload-pack-request
- C:
- C: 0032want 0a53e9ddeaddad63ad106860237bbf53411d11a7\n
- C: 0032have 441b40d833fdfa93eb2908e52742248faf0ee993\n
- C: 0000
-
- S: 200 OK
- S: Content-Type: application/x-git-upload-pack-result
- S: Cache-Control: no-cache
- S:
- S: ....ACK %s, continue
- S: ....NAK
-
-Clients MUST NOT reuse or revalidate a cached response.
-Servers MUST include sufficient Cache-Control headers
-to prevent caching of the response.
-
-Servers SHOULD support all capabilities defined here.
-
-Clients MUST send at least one "want" command in the request body.
-Clients MUST NOT reference an id in a "want" command which did not
-appear in the response obtained through ref discovery unless the
-server advertises capability `allow-tip-sha1-in-want` or
-`allow-reachable-sha1-in-want`.
-
- compute_request = want_list
- have_list
- request_end
- request_end = "0000" / "done"
-
- want_list = PKT-LINE(want SP cap_list LF)
- *(want_pkt)
- want_pkt = PKT-LINE(want LF)
- want = "want" SP id
- cap_list = capability *(SP capability)
-
- have_list = *PKT-LINE("have" SP id LF)
-
-TODO: Document this further.
-
-The Negotiation Algorithm
-~~~~~~~~~~~~~~~~~~~~~~~~~
-The computation to select the minimal pack proceeds as follows
-(C = client, S = server):
-
-'init step:'
-
-C: Use ref discovery to obtain the advertised refs.
-
-C: Place any object seen into set `advertised`.
-
-C: Build an empty set, `common`, to hold the objects that are later
- determined to be on both ends.
-
-C: Build a set, `want`, of the objects from `advertised` the client
- wants to fetch, based on what it saw during ref discovery.
-
-C: Start a queue, `c_pending`, ordered by commit time (popping newest
- first). Add all client refs. When a commit is popped from
- the queue its parents SHOULD be automatically inserted back.
- Commits MUST only enter the queue once.
-
-'one compute step:'
-
-C: Send one `$GIT_URL/git-upload-pack` request:
-
- C: 0032want <want #1>...............................
- C: 0032want <want #2>...............................
- ....
- C: 0032have <common #1>.............................
- C: 0032have <common #2>.............................
- ....
- C: 0032have <have #1>...............................
- C: 0032have <have #2>...............................
- ....
- C: 0000
-
-The stream is organized into "commands", with each command
-appearing by itself in a pkt-line. Within a command line,
-the text leading up to the first space is the command name,
-and the remainder of the line to the first LF is the value.
-Command lines are terminated with an LF as the last byte of
-the pkt-line value.
-
-Commands MUST appear in the following order, if they appear
-at all in the request stream:
-
-* "want"
-* "have"
-
-The stream is terminated by a pkt-line flush (`0000`).
-
-A single "want" or "have" command MUST have one hex formatted
-SHA-1 as its value. Multiple SHA-1s MUST be sent by sending
-multiple commands.
-
-The `have` list is created by popping the first 32 commits
-from `c_pending`. Less can be supplied if `c_pending` empties.
-
-If the client has sent 256 "have" commits and has not yet
-received one of those back from `s_common`, or the client has
-emptied `c_pending` it SHOULD include a "done" command to let
-the server know it won't proceed:
-
- C: 0009done
-
-S: Parse the git-upload-pack request:
-
-Verify all objects in `want` are directly reachable from refs.
-
-The server MAY walk backwards through history or through
-the reflog to permit slightly stale requests.
-
-If no "want" objects are received, send an error:
-TODO: Define error if no "want" lines are requested.
-
-If any "want" object is not reachable, send an error:
-TODO: Define error if an invalid "want" is requested.
-
-Create an empty list, `s_common`.
-
-If "have" was sent:
-
-Loop through the objects in the order supplied by the client.
-
-For each object, if the server has the object reachable from
-a ref, add it to `s_common`. If a commit is added to `s_common`,
-do not add any ancestors, even if they also appear in `have`.
-
-S: Send the git-upload-pack response:
-
-If the server has found a closed set of objects to pack or the
-request ends with "done", it replies with the pack.
-TODO: Document the pack based response
-
- S: PACK...
-
-The returned stream is the side-band-64k protocol supported
-by the git-upload-pack service, and the pack is embedded into
-stream 1. Progress messages from the server side MAY appear
-in stream 2.
-
-Here a "closed set of objects" is defined to have at least
-one path from every "want" to at least one "common" object.
-
-If the server needs more information, it replies with a
-status continue response:
-TODO: Document the non-pack response
-
-C: Parse the upload-pack response:
- TODO: Document parsing response
-
-'Do another compute step.'
-
-
-Smart Service git-receive-pack
-------------------------------
-This service reads from the repository pointed to by `$GIT_URL`.
-
-Clients MUST first perform ref discovery with
-`$GIT_URL/info/refs?service=git-receive-pack`.
-
- C: POST $GIT_URL/git-receive-pack HTTP/1.0
- C: Content-Type: application/x-git-receive-pack-request
- C:
- C: ....0a53e9ddeaddad63ad106860237bbf53411d11a7 441b40d833fdfa93eb2908e52742248faf0ee993 refs/heads/maint\0 report-status
- C: 0000
- C: PACK....
-
- S: 200 OK
- S: Content-Type: application/x-git-receive-pack-result
- S: Cache-Control: no-cache
- S:
- S: ....
-
-Clients MUST NOT reuse or revalidate a cached response.
-Servers MUST include sufficient Cache-Control headers
-to prevent caching of the response.
-
-Servers SHOULD support all capabilities defined here.
-
-Clients MUST send at least one command in the request body.
-Within the command portion of the request body clients SHOULD send
-the id obtained through ref discovery as old_id.
-
- update_request = command_list
- "PACK" <binary data>
-
- command_list = PKT-LINE(command NUL cap_list LF)
- *(command_pkt)
- command_pkt = PKT-LINE(command LF)
- cap_list = *(SP capability) SP
-
- command = create / delete / update
- create = zero-id SP new_id SP name
- delete = old_id SP zero-id SP name
- update = old_id SP new_id SP name
-
-TODO: Document this further.
-
-
-References
-----------
-
-http://www.ietf.org/rfc/rfc1738.txt[RFC 1738: Uniform Resource Locators (URL)]
-http://www.ietf.org/rfc/rfc2616.txt[RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1]
-link:technical/pack-protocol.html
-link:technical/protocol-capabilities.html
diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
deleted file mode 100644
index 7c4d67a..0000000
--- a/Documentation/technical/index-format.txt
+++ /dev/null
@@ -1,357 +0,0 @@
-Git index format
-================
-
-== The Git index file has the following format
-
- All binary numbers are in network byte order. Version 2 is described
- here unless stated otherwise.
-
- - A 12-byte header consisting of
-
- 4-byte signature:
- The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
-
- 4-byte version number:
- The current supported versions are 2, 3 and 4.
-
- 32-bit number of index entries.
-
- - A number of sorted index entries (see below).
-
- - Extensions
-
- Extensions are identified by signature. Optional extensions can
- be ignored if Git does not understand them.
-
- Git currently supports cached tree and resolve undo extensions.
-
- 4-byte extension signature. If the first byte is 'A'..'Z' the
- extension is optional and can be ignored.
-
- 32-bit size of the extension
-
- Extension data
-
- - 160-bit SHA-1 over the content of the index file before this
- checksum.
-
-== Index entry
-
- Index entries are sorted in ascending order on the name field,
- interpreted as a string of unsigned bytes (i.e. memcmp() order, no
- localization, no special casing of directory separator '/'). Entries
- with the same name are sorted by their stage field.
-
- 32-bit ctime seconds, the last time a file's metadata changed
- this is stat(2) data
-
- 32-bit ctime nanosecond fractions
- this is stat(2) data
-
- 32-bit mtime seconds, the last time a file's data changed
- this is stat(2) data
-
- 32-bit mtime nanosecond fractions
- this is stat(2) data
-
- 32-bit dev
- this is stat(2) data
-
- 32-bit ino
- this is stat(2) data
-
- 32-bit mode, split into (high to low bits)
-
- 4-bit object type
- valid values in binary are 1000 (regular file), 1010 (symbolic link)
- and 1110 (gitlink)
-
- 3-bit unused
-
- 9-bit unix permission. Only 0755 and 0644 are valid for regular files.
- Symbolic links and gitlinks have value 0 in this field.
-
- 32-bit uid
- this is stat(2) data
-
- 32-bit gid
- this is stat(2) data
-
- 32-bit file size
- This is the on-disk size from stat(2), truncated to 32-bit.
-
- 160-bit SHA-1 for the represented object
-
- A 16-bit 'flags' field split into (high to low bits)
-
- 1-bit assume-valid flag
-
- 1-bit extended flag (must be zero in version 2)
-
- 2-bit stage (during merge)
-
- 12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
- is stored in this field.
-
- (Version 3 or later) A 16-bit field, only applicable if the
- "extended flag" above is 1, split into (high to low bits).
-
- 1-bit reserved for future
-
- 1-bit skip-worktree flag (used by sparse checkout)
-
- 1-bit intent-to-add flag (used by "git add -N")
-
- 13-bit unused, must be zero
-
- Entry path name (variable length) relative to top level directory
- (without leading slash). '/' is used as path separator. The special
- path components ".", ".." and ".git" (without quotes) are disallowed.
- Trailing slash is also disallowed.
-
- The exact encoding is undefined, but the '.' and '/' characters
- are encoded in 7-bit ASCII and the encoding cannot contain a NUL
- byte (iow, this is a UNIX pathname).
-
- (Version 4) In version 4, the entry path name is prefix-compressed
- relative to the path name for the previous entry (the very first
- entry is encoded as if the path name for the previous entry is an
- empty string). At the beginning of an entry, an integer N in the
- variable width encoding (the same encoding as the offset is encoded
- for OFS_DELTA pack entries; see pack-format.txt) is stored, followed
- by a NUL-terminated string S. Removing N bytes from the end of the
- path name for the previous entry, and replacing it with the string S
- yields the path name for this entry.
-
- 1-8 nul bytes as necessary to pad the entry to a multiple of eight bytes
- while keeping the name NUL-terminated.
-
- (Version 4) In version 4, the padding after the pathname does not
- exist.
-
- Interpretation of index entries in split index mode is completely
- different. See below for details.
-
-== Extensions
-
-=== Cached tree
-
- Cached tree extension contains pre-computed hashes for trees that can
- be derived from the index. It helps speed up tree object generation
- from index for a new commit.
-
- When a path is updated in index, the path must be invalidated and
- removed from tree cache.
-
- The signature for this extension is { 'T', 'R', 'E', 'E' }.
-
- A series of entries fill the entire extension; each of which
- consists of:
-
- - NUL-terminated path component (relative to its parent directory);
-
- - ASCII decimal number of entries in the index that is covered by the
- tree this entry represents (entry_count);
-
- - A space (ASCII 32);
-
- - ASCII decimal number that represents the number of subtrees this
- tree has;
-
- - A newline (ASCII 10); and
-
- - 160-bit object name for the object that would result from writing
- this span of index as a tree.
-
- An entry can be in an invalidated state and is represented by having
- a negative number in the entry_count field. In this case, there is no
- object name and the next entry starts immediately after the newline.
- When writing an invalid entry, -1 should always be used as entry_count.
-
- The entries are written out in the top-down, depth-first order. The
- first entry represents the root level of the repository, followed by the
- first subtree--let's call this A--of the root level (with its name
- relative to the root level), followed by the first subtree of A (with
- its name relative to A), ...
-
-=== Resolve undo
-
- A conflict is represented in the index as a set of higher stage entries.
- When a conflict is resolved (e.g. with "git add path"), these higher
- stage entries will be removed and a stage-0 entry with proper resolution
- is added.
-
- When these higher stage entries are removed, they are saved in the
- resolve undo extension, so that conflicts can be recreated (e.g. with
- "git checkout -m"), in case users want to redo a conflict resolution
- from scratch.
-
- The signature for this extension is { 'R', 'E', 'U', 'C' }.
-
- A series of entries fill the entire extension; each of which
- consists of:
-
- - NUL-terminated pathname the entry describes (relative to the root of
- the repository, i.e. full pathname);
-
- - Three NUL-terminated ASCII octal numbers, entry mode of entries in
- stage 1 to 3 (a missing stage is represented by "0" in this field);
- and
-
- - At most three 160-bit object names of the entry in stages from 1 to 3
- (nothing is written for a missing stage).
-
-=== Split index
-
- In split index mode, the majority of index entries could be stored
- in a separate file. This extension records the changes to be made on
- top of that to produce the final index.
-
- The signature for this extension is { 'l', 'i', 'n', 'k' }.
-
- The extension consists of:
-
- - 160-bit SHA-1 of the shared index file. The shared index file path
- is $GIT_DIR/sharedindex.<SHA-1>. If all 160 bits are zero, the
- index does not require a shared index file.
-
- - An ewah-encoded delete bitmap, each bit represents an entry in the
- shared index. If a bit is set, its corresponding entry in the
- shared index will be removed from the final index. Note, because
- a delete operation changes index entry positions, but we do need
- original positions in replace phase, it's best to just mark
- entries for removal, then do a mass deletion after replacement.
-
- - An ewah-encoded replace bitmap, each bit represents an entry in
- the shared index. If a bit is set, its corresponding entry in the
- shared index will be replaced with an entry in this index
- file. All replaced entries are stored in sorted order in this
- index. The first "1" bit in the replace bitmap corresponds to the
- first index entry, the second "1" bit to the second entry and so
- on. Replaced entries may have empty path names to save space.
-
- The remaining index entries after replaced ones will be added to the
- final index. These added entries are also sorted by entry name then
- stage.
-
-== Untracked cache
-
- Untracked cache saves the untracked file list and necessary data to
- verify the cache. The signature for this extension is { 'U', 'N',
- 'T', 'R' }.
-
- The extension starts with
-
- - A sequence of NUL-terminated strings, preceded by the size of the
- sequence in variable width encoding. Each string describes the
- environment where the cache can be used.
-
- - Stat data of $GIT_DIR/info/exclude. See "Index entry" section from
- ctime field until "file size".
-
- - Stat data of core.excludesfile
-
- - 32-bit dir_flags (see struct dir_struct)
-
- - 160-bit SHA-1 of $GIT_DIR/info/exclude. Null SHA-1 means the file
- does not exist.
-
- - 160-bit SHA-1 of core.excludesfile. Null SHA-1 means the file does
- not exist.
-
- - NUL-terminated string of per-dir exclude file name. This usually
- is ".gitignore".
-
- - The number of following directory blocks, variable width
- encoding. If this number is zero, the extension ends here with a
- following NUL.
-
- - A number of directory blocks in depth-first-search order, each
- consists of
-
- - The number of untracked entries, variable width encoding.
-
- - The number of sub-directory blocks, variable width encoding.
-
- - The directory name terminated by NUL.
-
- - A number of untracked file/dir names terminated by NUL.
-
-The remaining data of each directory block is grouped by type:
-
- - An ewah bitmap, the n-th bit marks whether the n-th directory has
- valid untracked cache entries.
-
- - An ewah bitmap, the n-th bit records "check-only" bit of
- read_directory_recursive() for the n-th directory.
-
- - An ewah bitmap, the n-th bit indicates whether SHA-1 and stat data
- is valid for the n-th directory and exists in the next data.
-
- - An array of stat data. The n-th data corresponds with the n-th
- "one" bit in the previous ewah bitmap.
-
- - An array of SHA-1. The n-th SHA-1 corresponds with the n-th "one" bit
- in the previous ewah bitmap.
-
- - One NUL.
-
-== File System Monitor cache
-
- The file system monitor cache tracks files for which the core.fsmonitor
- hook has told us about changes. The signature for this extension is
- { 'F', 'S', 'M', 'N' }.
-
- The extension starts with
-
- - 32-bit version number: the current supported version is 1.
-
- - 64-bit time: the extension data reflects all changes through the given
- time which is stored as the nanoseconds elapsed since midnight,
- January 1, 1970.
-
- - 32-bit bitmap size: the size of the CE_FSMONITOR_VALID bitmap.
-
- - An ewah bitmap, the n-th bit indicates whether the n-th index entry
- is not CE_FSMONITOR_VALID.
-
-== End of Index Entry
-
- The End of Index Entry (EOIE) is used to locate the end of the variable
- length index entries and the begining of the extensions. Code can take
- advantage of this to quickly locate the index extensions without having
- to parse through all of the index entries.
-
- Because it must be able to be loaded before the variable length cache
- entries and other index extensions, this extension must be written last.
- The signature for this extension is { 'E', 'O', 'I', 'E' }.
-
- The extension consists of:
-
- - 32-bit offset to the end of the index entries
-
- - 160-bit SHA-1 over the extension types and their sizes (but not
- their contents). E.g. if we have "TREE" extension that is N-bytes
- long, "REUC" extension that is M-bytes long, followed by "EOIE",
- then the hash would be:
-
- SHA-1("TREE" + <binary representation of N> +
- "REUC" + <binary representation of M>)
-
-== Index Entry Offset Table
-
- The Index Entry Offset Table (IEOT) is used to help address the CPU
- cost of loading the index by enabling multi-threading the process of
- converting cache entries from the on-disk format to the in-memory format.
- The signature for this extension is { 'I', 'E', 'O', 'T' }.
-
- The extension consists of:
-
- - 32-bit version (currently 1)
-
- - A number of index offset entries each consisting of:
-
- - 32-bit offset from the begining of the file to the first cache entry
- in this block of entries.
-
- - 32-bit count of cache entries in this block
diff --git a/Documentation/technical/long-running-process-protocol.txt b/Documentation/technical/long-running-process-protocol.txt
index aa0aa9a..6f33654 100644
--- a/Documentation/technical/long-running-process-protocol.txt
+++ b/Documentation/technical/long-running-process-protocol.txt
@@ -3,7 +3,7 @@ Long-running process protocol
This protocol is used when Git needs to communicate with an external
process throughout the entire life of a single Git command. All
-communication is in pkt-line format (see technical/protocol-common.txt)
+communication is in pkt-line format (see linkgit:gitprotocol-common[5])
over standard input and standard output.
Handshake
diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
index d7e5763..f2221d2 100644
--- a/Documentation/technical/multi-pack-index.txt
+++ b/Documentation/technical/multi-pack-index.txt
@@ -17,13 +17,14 @@ is not feasible due to storage space or excessive repack times.
The multi-pack-index (MIDX for short) stores a list of objects
and their offsets into multiple packfiles. It contains:
-- A list of packfile names.
-- A sorted list of object IDs.
-- A list of metadata for the ith object ID including:
- - A value j referring to the jth packfile.
- - An offset within the jth packfile for the object.
-- If large offsets are required, we use another list of large
+* A list of packfile names.
+* A sorted list of object IDs.
+* A list of metadata for the ith object ID including:
+** A value j referring to the jth packfile.
+** An offset within the jth packfile for the object.
+* If large offsets are required, we use another list of large
offsets similar to version 2 pack-indexes.
+- An optional list of objects in pseudo-pack order (used with MIDX bitmaps).
Thus, we can provide O(log N) lookup time for any number
of packfiles.
@@ -36,15 +37,18 @@ Design Details
directory of an alternate. It refers only to packfiles in that
same directory.
-- The pack.multiIndex config setting must be on to consume MIDX files.
+- The core.multiPackIndex config setting must be on (which is the
+ default) to consume MIDX files. Setting it to `false` prevents
+ Git from reading a MIDX file, even if one exists.
- The file format includes parameters for the object ID hash
function, so a future change of hash algorithm does not require
a change in format.
- The MIDX keeps only one record per object ID. If an object appears
- in multiple packfiles, then the MIDX selects the copy in the most-
- recently modified packfile.
+ in multiple packfiles, then the MIDX selects the copy in the
+ preferred packfile, otherwise selecting from the most-recently
+ modified packfile.
- If there exist packfiles in the pack directory not registered in
the MIDX, then those packfiles are loaded into the `packed_git`
@@ -60,10 +64,6 @@ Design Details
Future Work
-----------
-- Add a 'verify' subcommand to the 'git midx' builtin to verify the
- contents of the multi-pack-index file match the offsets listed in
- the corresponding pack-indexes.
-
- The multi-pack-index allows many packfiles, especially in a context
where repacking is expensive (such as a very large repo), or
unexpected maintenance time is unacceptable (such as a high-demand
@@ -74,14 +74,10 @@ Future Work
still reducing the number of binary searches required for object
lookups.
-- The reachability bitmap is currently paired directly with a single
- packfile, using the pack-order as the object order to hopefully
- compress the bitmaps well using run-length encoding. This could be
- extended to pair a reachability bitmap with a multi-pack-index. If
- the multi-pack-index is extended to store a "stable object order"
+- If the multi-pack-index is extended to store a "stable object order"
(a function Order(hash) = integer that is constant for a given hash,
- even as the multi-pack-index is updated) then a reachability bitmap
- could point to a multi-pack-index and be updated independently.
+ even as the multi-pack-index is updated) then MIDX bitmaps could be
+ updated independently of the MIDX.
- Packfiles can be marked as "special" using empty files that share
the initial name but replace ".pack" with ".keep" or ".promisor".
@@ -92,18 +88,13 @@ Future Work
helpful to organize packfiles by object type (commit, tree, blob,
etc.) and use this metadata to help that maintenance.
-- The partial clone feature records special "promisor" packs that
- may point to objects that are not stored locally, but available
- on request to a server. The multi-pack-index does not currently
- track these promisor packs.
-
Related Links
-------------
[0] https://bugs.chromium.org/p/git/issues/detail?id=6
Chromium work item for: Multi-Pack Index (MIDX)
-[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/
+[1] https://lore.kernel.org/git/20180107181459.222909-1-dstolee@microsoft.com/
An earlier RFC for the multi-pack-index feature
-[2] https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
+[2] https://lore.kernel.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
Git Merge 2018 Contributor's summit notes (includes discussion of MIDX)
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
deleted file mode 100644
index cab5bdd..0000000
--- a/Documentation/technical/pack-format.txt
+++ /dev/null
@@ -1,331 +0,0 @@
-Git pack format
-===============
-
-== pack-*.pack files have the following format:
-
- - A header appears at the beginning and consists of the following:
-
- 4-byte signature:
- The signature is: {'P', 'A', 'C', 'K'}
-
- 4-byte version number (network byte order):
- Git currently accepts version number 2 or 3 but
- generates version 2 only.
-
- 4-byte number of objects contained in the pack (network byte order)
-
- Observation: we cannot have more than 4G versions ;-) and
- more than 4G objects in a pack.
-
- - The header is followed by number of object entries, each of
- which looks like this:
-
- (undeltified representation)
- n-byte type and length (3-bit type, (n-1)*7+4-bit length)
- compressed data
-
- (deltified representation)
- n-byte type and length (3-bit type, (n-1)*7+4-bit length)
- 20-byte base object name if OBJ_REF_DELTA or a negative relative
- offset from the delta object's position in the pack if this
- is an OBJ_OFS_DELTA object
- compressed delta data
-
- Observation: length of each object is encoded in a variable
- length format and is not constrained to 32-bit or anything.
-
- - The trailer records 20-byte SHA-1 checksum of all of the above.
-
-=== Object types
-
-Valid object types are:
-
-- OBJ_COMMIT (1)
-- OBJ_TREE (2)
-- OBJ_BLOB (3)
-- OBJ_TAG (4)
-- OBJ_OFS_DELTA (6)
-- OBJ_REF_DELTA (7)
-
-Type 5 is reserved for future expansion. Type 0 is invalid.
-
-=== Deltified representation
-
-Conceptually there are only four object types: commit, tree, tag and
-blob. However to save space, an object could be stored as a "delta" of
-another "base" object. These representations are assigned new types
-ofs-delta and ref-delta, which is only valid in a pack file.
-
-Both ofs-delta and ref-delta store the "delta" to be applied to
-another object (called 'base object') to reconstruct the object. The
-difference between them is, ref-delta directly encodes 20-byte base
-object name. If the base object is in the same pack, ofs-delta encodes
-the offset of the base object in the pack instead.
-
-The base object could also be deltified if it's in the same pack.
-Ref-delta can also refer to an object outside the pack (i.e. the
-so-called "thin pack"). When stored on disk however, the pack should
-be self contained to avoid cyclic dependency.
-
-The delta data is a sequence of instructions to reconstruct an object
-from the base object. If the base object is deltified, it must be
-converted to canonical form first. Each instruction appends more and
-more data to the target object until it's complete. There are two
-supported instructions so far: one for copy a byte range from the
-source object and one for inserting new data embedded in the
-instruction itself.
-
-Each instruction has variable length. Instruction type is determined
-by the seventh bit of the first octet. The following diagrams follow
-the convention in RFC 1951 (Deflate compressed data format).
-
-==== Instruction to copy from base object
-
- +----------+---------+---------+---------+---------+-------+-------+-------+
- | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 |
- +----------+---------+---------+---------+---------+-------+-------+-------+
-
-This is the instruction format to copy a byte range from the source
-object. It encodes the offset to copy from and the number of bytes to
-copy. Offset and size are in little-endian order.
-
-All offset and size bytes are optional. This is to reduce the
-instruction size when encoding small offsets or sizes. The first seven
-bits in the first octet determines which of the next seven octets is
-present. If bit zero is set, offset1 is present. If bit one is set
-offset2 is present and so on.
-
-Note that a more compact instruction does not change offset and size
-encoding. For example, if only offset2 is omitted like below, offset3
-still contains bits 16-23. It does not become offset2 and contains
-bits 8-15 even if it's right next to offset1.
-
- +----------+---------+---------+
- | 10000101 | offset1 | offset3 |
- +----------+---------+---------+
-
-In its most compact form, this instruction only takes up one byte
-(0x80) with both offset and size omitted, which will have default
-values zero. There is another exception: size zero is automatically
-converted to 0x10000.
-
-==== Instruction to add new data
-
- +----------+============+
- | 0xxxxxxx | data |
- +----------+============+
-
-This is the instruction to construct target object without the base
-object. The following data is appended to the target object. The first
-seven bits of the first octet determines the size of data in
-bytes. The size must be non-zero.
-
-==== Reserved instruction
-
- +----------+============
- | 00000000 |
- +----------+============
-
-This is the instruction reserved for future expansion.
-
-== Original (version 1) pack-*.idx files have the following format:
-
- - The header consists of 256 4-byte network byte order
- integers. N-th entry of this table records the number of
- objects in the corresponding pack, the first byte of whose
- object name is less than or equal to N. This is called the
- 'first-level fan-out' table.
-
- - The header is followed by sorted 24-byte entries, one entry
- per object in the pack. Each entry is:
-
- 4-byte network byte order integer, recording where the
- object is stored in the packfile as the offset from the
- beginning.
-
- 20-byte object name.
-
- - The file is concluded with a trailer:
-
- A copy of the 20-byte SHA-1 checksum at the end of
- corresponding packfile.
-
- 20-byte SHA-1-checksum of all of the above.
-
-Pack Idx file:
-
- -- +--------------------------------+
-fanout | fanout[0] = 2 (for example) |-.
-table +--------------------------------+ |
- | fanout[1] | |
- +--------------------------------+ |
- | fanout[2] | |
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
- | fanout[255] = total objects |---.
- -- +--------------------------------+ | |
-main | offset | | |
-index | object name 00XXXXXXXXXXXXXXXX | | |
-table +--------------------------------+ | |
- | offset | | |
- | object name 00XXXXXXXXXXXXXXXX | | |
- +--------------------------------+<+ |
- .-| offset | |
- | | object name 01XXXXXXXXXXXXXXXX | |
- | +--------------------------------+ |
- | | offset | |
- | | object name 01XXXXXXXXXXXXXXXX | |
- | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
- | | offset | |
- | | object name FFXXXXXXXXXXXXXXXX | |
- --| +--------------------------------+<--+
-trailer | | packfile checksum |
- | +--------------------------------+
- | | idxfile checksum |
- | +--------------------------------+
- .-------.
- |
-Pack file entry: <+
-
- packed object header:
- 1-byte size extension bit (MSB)
- type (next 3 bit)
- size0 (lower 4-bit)
- n-byte sizeN (as long as MSB is set, each 7-bit)
- size0..sizeN form 4+7+7+..+7 bit integer, size0
- is the least significant part, and sizeN is the
- most significant part.
- packed object data:
- If it is not DELTA, then deflated bytes (the size above
- is the size before compression).
- If it is REF_DELTA, then
- 20-byte base object name SHA-1 (the size above is the
- size of the delta data that follows).
- delta data, deflated.
- If it is OFS_DELTA, then
- n-byte offset (see below) interpreted as a negative
- offset from the type-byte of the header of the
- ofs-delta entry (the size above is the size of
- the delta data that follows).
- delta data, deflated.
-
- offset encoding:
- n bytes with MSB set in all but the last one.
- The offset is then the number constructed by
- concatenating the lower 7 bit of each byte, and
- for n >= 2 adding 2^7 + 2^14 + ... + 2^(7*(n-1))
- to the result.
-
-
-
-== Version 2 pack-*.idx files support packs larger than 4 GiB, and
- have some other reorganizations. They have the format:
-
- - A 4-byte magic number '\377tOc' which is an unreasonable
- fanout[0] value.
-
- - A 4-byte version number (= 2)
-
- - A 256-entry fan-out table just like v1.
-
- - A table of sorted 20-byte SHA-1 object names. These are
- packed together without offset values to reduce the cache
- footprint of the binary search for a specific object name.
-
- - A table of 4-byte CRC32 values of the packed object data.
- This is new in v2 so compressed data can be copied directly
- from pack to pack during repacking without undetected
- data corruption.
-
- - A table of 4-byte offset values (in network byte order).
- These are usually 31-bit pack file offsets, but large
- offsets are encoded as an index into the next table with
- the msbit set.
-
- - A table of 8-byte offset entries (empty for pack files less
- than 2 GiB). Pack files are organized with heavily used
- objects toward the front, so most object references should
- not need to refer to this table.
-
- - The same trailer as a v1 pack file:
-
- A copy of the 20-byte SHA-1 checksum at the end of
- corresponding packfile.
-
- 20-byte SHA-1-checksum of all of the above.
-
-== multi-pack-index (MIDX) files have the following format:
-
-The multi-pack-index files refer to multiple pack-files and loose objects.
-
-In order to allow extensions that add extra data to the MIDX, we organize
-the body into "chunks" and provide a lookup table at the beginning of the
-body. The header includes certain length values, such as the number of packs,
-the number of base MIDX files, hash lengths and types.
-
-All 4-byte numbers are in network order.
-
-HEADER:
-
- 4-byte signature:
- The signature is: {'M', 'I', 'D', 'X'}
-
- 1-byte version number:
- Git only writes or recognizes version 1.
-
- 1-byte Object Id Version
- Git only writes or recognizes version 1 (SHA1).
-
- 1-byte number of "chunks"
-
- 1-byte number of base multi-pack-index files:
- This value is currently always zero.
-
- 4-byte number of pack files
-
-CHUNK LOOKUP:
-
- (C + 1) * 12 bytes providing the chunk offsets:
- First 4 bytes describe chunk id. Value 0 is a terminating label.
- Other 8 bytes provide offset in current file for chunk to start.
- (Chunks are provided in file-order, so you can infer the length
- using the next chunk position if necessary.)
-
- The remaining data in the body is described one chunk at a time, and
- these chunks may be given in any order. Chunks are required unless
- otherwise specified.
-
-CHUNK DATA:
-
- Packfile Names (ID: {'P', 'N', 'A', 'M'})
- Stores the packfile names as concatenated, null-terminated strings.
- Packfiles must be listed in lexicographic order for fast lookups by
- name. This is the only chunk not guaranteed to be a multiple of four
- bytes in length, so should be the last chunk for alignment reasons.
-
- OID Fanout (ID: {'O', 'I', 'D', 'F'})
- The ith entry, F[i], stores the number of OIDs with first
- byte at most i. Thus F[255] stores the total
- number of objects.
-
- OID Lookup (ID: {'O', 'I', 'D', 'L'})
- The OIDs for all objects in the MIDX are stored in lexicographic
- order in this chunk.
-
- Object Offsets (ID: {'O', 'O', 'F', 'F'})
- Stores two 4-byte values for every object.
- 1: The pack-int-id for the pack storing this object.
- 2: The offset within the pack.
- If all offsets are less than 2^31, then the large offset chunk
- will not exist and offsets are stored as in IDX v1.
- If there is at least one offset value larger than 2^32-1, then
- the large offset chunk must exist. If the large offset chunk
- exists and the 31st bit is on, then removing that bit reveals
- the row in the large offsets containing the 8-byte offset of
- this object.
-
- [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
- 8-byte offsets into large packfiles.
-
-TRAILER:
-
- 20-byte SHA1-checksum of the above contents.
diff --git a/Documentation/technical/pack-protocol.txt b/Documentation/technical/pack-protocol.txt
deleted file mode 100644
index c73e72d..0000000
--- a/Documentation/technical/pack-protocol.txt
+++ /dev/null
@@ -1,674 +0,0 @@
-Packfile transfer protocols
-===========================
-
-Git supports transferring data in packfiles over the ssh://, git://, http:// and
-file:// transports. There exist two sets of protocols, one for pushing
-data from a client to a server and another for fetching data from a
-server to a client. The three transports (ssh, git, file) use the same
-protocol to transfer data. http is documented in http-protocol.txt.
-
-The processes invoked in the canonical Git implementation are 'upload-pack'
-on the server side and 'fetch-pack' on the client side for fetching data;
-then 'receive-pack' on the server and 'send-pack' on the client for pushing
-data. The protocol functions to have a server tell a client what is
-currently on the server, then for the two to negotiate the smallest amount
-of data to send in order to fully update one or the other.
-
-pkt-line Format
----------------
-
-The descriptions below build on the pkt-line format described in
-protocol-common.txt. When the grammar indicate `PKT-LINE(...)`, unless
-otherwise noted the usual pkt-line LF rules apply: the sender SHOULD
-include a LF, but the receiver MUST NOT complain if it is not present.
-
-An error packet is a special pkt-line that contains an error string.
-
-----
- error-line = PKT-LINE("ERR" SP explanation-text)
-----
-
-Throughout the protocol, where `PKT-LINE(...)` is expected, an error packet MAY
-be sent. Once this packet is sent by a client or a server, the data transfer
-process defined in this protocol is terminated.
-
-Transports
-----------
-There are three transports over which the packfile protocol is
-initiated. The Git transport is a simple, unauthenticated server that
-takes the command (almost always 'upload-pack', though Git
-servers can be configured to be globally writable, in which 'receive-
-pack' initiation is also allowed) with which the client wishes to
-communicate and executes it and connects it to the requesting
-process.
-
-In the SSH transport, the client just runs the 'upload-pack'
-or 'receive-pack' process on the server over the SSH protocol and then
-communicates with that invoked process over the SSH connection.
-
-The file:// transport runs the 'upload-pack' or 'receive-pack'
-process locally and communicates with it over a pipe.
-
-Extra Parameters
-----------------
-
-The protocol provides a mechanism in which clients can send additional
-information in its first message to the server. These are called "Extra
-Parameters", and are supported by the Git, SSH, and HTTP protocols.
-
-Each Extra Parameter takes the form of `<key>=<value>` or `<key>`.
-
-Servers that receive any such Extra Parameters MUST ignore all
-unrecognized keys. Currently, the only Extra Parameter recognized is
-"version" with a value of '1' or '2'. See protocol-v2.txt for more
-information on protocol version 2.
-
-Git Transport
--------------
-
-The Git transport starts off by sending the command and repository
-on the wire using the pkt-line format, followed by a NUL byte and a
-hostname parameter, terminated by a NUL byte.
-
- 0033git-upload-pack /project.git\0host=myserver.com\0
-
-The transport may send Extra Parameters by adding an additional NUL
-byte, and then adding one or more NUL-terminated strings:
-
- 003egit-upload-pack /project.git\0host=myserver.com\0\0version=1\0
-
---
- git-proto-request = request-command SP pathname NUL
- [ host-parameter NUL ] [ NUL extra-parameters ]
- request-command = "git-upload-pack" / "git-receive-pack" /
- "git-upload-archive" ; case sensitive
- pathname = *( %x01-ff ) ; exclude NUL
- host-parameter = "host=" hostname [ ":" port ]
- extra-parameters = 1*extra-parameter
- extra-parameter = 1*( %x01-ff ) NUL
---
-
-host-parameter is used for the
-git-daemon name based virtual hosting. See --interpolated-path
-option to git daemon, with the %H/%CH format characters.
-
-Basically what the Git client is doing to connect to an 'upload-pack'
-process on the server side over the Git protocol is this:
-
- $ echo -e -n \
- "0039git-upload-pack /schacon/gitbook.git\0host=example.com\0" |
- nc -v example.com 9418
-
-
-SSH Transport
--------------
-
-Initiating the upload-pack or receive-pack processes over SSH is
-executing the binary on the server via SSH remote execution.
-It is basically equivalent to running this:
-
- $ ssh git.example.com "git-upload-pack '/project.git'"
-
-For a server to support Git pushing and pulling for a given user over
-SSH, that user needs to be able to execute one or both of those
-commands via the SSH shell that they are provided on login. On some
-systems, that shell access is limited to only being able to run those
-two commands, or even just one of them.
-
-In an ssh:// format URI, it's absolute in the URI, so the '/' after
-the host name (or port number) is sent as an argument, which is then
-read by the remote git-upload-pack exactly as is, so it's effectively
-an absolute path in the remote filesystem.
-
- git clone ssh://user@example.com/project.git
- |
- v
- ssh user@example.com "git-upload-pack '/project.git'"
-
-In a "user@host:path" format URI, its relative to the user's home
-directory, because the Git client will run:
-
- git clone user@example.com:project.git
- |
- v
- ssh user@example.com "git-upload-pack 'project.git'"
-
-The exception is if a '~' is used, in which case
-we execute it without the leading '/'.
-
- ssh://user@example.com/~alice/project.git,
- |
- v
- ssh user@example.com "git-upload-pack '~alice/project.git'"
-
-Depending on the value of the `protocol.version` configuration variable,
-Git may attempt to send Extra Parameters as a colon-separated string in
-the GIT_PROTOCOL environment variable. This is done only if
-the `ssh.variant` configuration variable indicates that the ssh command
-supports passing environment variables as an argument.
-
-A few things to remember here:
-
-- The "command name" is spelled with dash (e.g. git-upload-pack), but
- this can be overridden by the client;
-
-- The repository path is always quoted with single quotes.
-
-Fetching Data From a Server
----------------------------
-
-When one Git repository wants to get data that a second repository
-has, the first can 'fetch' from the second. This operation determines
-what data the server has that the client does not then streams that
-data down to the client in packfile format.
-
-
-Reference Discovery
--------------------
-
-When the client initially connects the server will immediately respond
-with a version number (if "version=1" is sent as an Extra Parameter),
-and a listing of each reference it has (all branches and tags) along
-with the object name that each reference currently points to.
-
- $ echo -e -n "0044git-upload-pack /schacon/gitbook.git\0host=example.com\0\0version=1\0" |
- nc -v example.com 9418
- 000aversion 1
- 00887217a7c7e582c46cec22a130adf4b9d7d950fba0 HEAD\0multi_ack thin-pack
- side-band side-band-64k ofs-delta shallow no-progress include-tag
- 00441d3fcd5ced445d1abc402225c0b8a1299641f497 refs/heads/integration
- 003f7217a7c7e582c46cec22a130adf4b9d7d950fba0 refs/heads/master
- 003cb88d2441cac0977faf98efc80305012112238d9d refs/tags/v0.9
- 003c525128480b96c89e6418b1e40909bf6c5b2d580f refs/tags/v1.0
- 003fe92df48743b7bc7d26bcaabfddde0a1e20cae47c refs/tags/v1.0^{}
- 0000
-
-The returned response is a pkt-line stream describing each ref and
-its current value. The stream MUST be sorted by name according to
-the C locale ordering.
-
-If HEAD is a valid ref, HEAD MUST appear as the first advertised
-ref. If HEAD is not a valid ref, HEAD MUST NOT appear in the
-advertisement list at all, but other refs may still appear.
-
-The stream MUST include capability declarations behind a NUL on the
-first ref. The peeled value of a ref (that is "ref^{}") MUST be
-immediately after the ref itself, if presented. A conforming server
-MUST peel the ref if it's an annotated tag.
-
-----
- advertised-refs = *1("version 1")
- (no-refs / list-of-refs)
- *shallow
- flush-pkt
-
- no-refs = PKT-LINE(zero-id SP "capabilities^{}"
- NUL capability-list)
-
- list-of-refs = first-ref *other-ref
- first-ref = PKT-LINE(obj-id SP refname
- NUL capability-list)
-
- other-ref = PKT-LINE(other-tip / other-peeled)
- other-tip = obj-id SP refname
- other-peeled = obj-id SP refname "^{}"
-
- shallow = PKT-LINE("shallow" SP obj-id)
-
- capability-list = capability *(SP capability)
- capability = 1*(LC_ALPHA / DIGIT / "-" / "_")
- LC_ALPHA = %x61-7A
-----
-
-Server and client MUST use lowercase for obj-id, both MUST treat obj-id
-as case-insensitive.
-
-See protocol-capabilities.txt for a list of allowed server capabilities
-and descriptions.
-
-Packfile Negotiation
---------------------
-After reference and capabilities discovery, the client can decide to
-terminate the connection by sending a flush-pkt, telling the server it can
-now gracefully terminate, and disconnect, when it does not need any pack
-data. This can happen with the ls-remote command, and also can happen when
-the client already is up to date.
-
-Otherwise, it enters the negotiation phase, where the client and
-server determine what the minimal packfile necessary for transport is,
-by telling the server what objects it wants, its shallow objects
-(if any), and the maximum commit depth it wants (if any). The client
-will also send a list of the capabilities it wants to be in effect,
-out of what the server said it could do with the first 'want' line.
-
-----
- upload-request = want-list
- *shallow-line
- *1depth-request
- [filter-request]
- flush-pkt
-
- want-list = first-want
- *additional-want
-
- shallow-line = PKT-LINE("shallow" SP obj-id)
-
- depth-request = PKT-LINE("deepen" SP depth) /
- PKT-LINE("deepen-since" SP timestamp) /
- PKT-LINE("deepen-not" SP ref)
-
- first-want = PKT-LINE("want" SP obj-id SP capability-list)
- additional-want = PKT-LINE("want" SP obj-id)
-
- depth = 1*DIGIT
-
- filter-request = PKT-LINE("filter" SP filter-spec)
-----
-
-Clients MUST send all the obj-ids it wants from the reference
-discovery phase as 'want' lines. Clients MUST send at least one
-'want' command in the request body. Clients MUST NOT mention an
-obj-id in a 'want' command which did not appear in the response
-obtained through ref discovery.
-
-The client MUST write all obj-ids which it only has shallow copies
-of (meaning that it does not have the parents of a commit) as
-'shallow' lines so that the server is aware of the limitations of
-the client's history.
-
-The client now sends the maximum commit history depth it wants for
-this transaction, which is the number of commits it wants from the
-tip of the history, if any, as a 'deepen' line. A depth of 0 is the
-same as not making a depth request. The client does not want to receive
-any commits beyond this depth, nor does it want objects needed only to
-complete those commits. Commits whose parents are not received as a
-result are defined as shallow and marked as such in the server. This
-information is sent back to the client in the next step.
-
-The client can optionally request that pack-objects omit various
-objects from the packfile using one of several filtering techniques.
-These are intended for use with partial clone and partial fetch
-operations. An object that does not meet a filter-spec value is
-omitted unless explicitly requested in a 'want' line. See `rev-list`
-for possible filter-spec values.
-
-Once all the 'want's and 'shallow's (and optional 'deepen') are
-transferred, clients MUST send a flush-pkt, to tell the server side
-that it is done sending the list.
-
-Otherwise, if the client sent a positive depth request, the server
-will determine which commits will and will not be shallow and
-send this information to the client. If the client did not request
-a positive depth, this step is skipped.
-
-----
- shallow-update = *shallow-line
- *unshallow-line
- flush-pkt
-
- shallow-line = PKT-LINE("shallow" SP obj-id)
-
- unshallow-line = PKT-LINE("unshallow" SP obj-id)
-----
-
-If the client has requested a positive depth, the server will compute
-the set of commits which are no deeper than the desired depth. The set
-of commits start at the client's wants.
-
-The server writes 'shallow' lines for each
-commit whose parents will not be sent as a result. The server writes
-an 'unshallow' line for each commit which the client has indicated is
-shallow, but is no longer shallow at the currently requested depth
-(that is, its parents will now be sent). The server MUST NOT mark
-as unshallow anything which the client has not indicated was shallow.
-
-Now the client will send a list of the obj-ids it has using 'have'
-lines, so the server can make a packfile that only contains the objects
-that the client needs. In multi_ack mode, the canonical implementation
-will send up to 32 of these at a time, then will send a flush-pkt. The
-canonical implementation will skip ahead and send the next 32 immediately,
-so that there is always a block of 32 "in-flight on the wire" at a time.
-
-----
- upload-haves = have-list
- compute-end
-
- have-list = *have-line
- have-line = PKT-LINE("have" SP obj-id)
- compute-end = flush-pkt / PKT-LINE("done")
-----
-
-If the server reads 'have' lines, it then will respond by ACKing any
-of the obj-ids the client said it had that the server also has. The
-server will ACK obj-ids differently depending on which ack mode is
-chosen by the client.
-
-In multi_ack mode:
-
- * the server will respond with 'ACK obj-id continue' for any common
- commits.
-
- * once the server has found an acceptable common base commit and is
- ready to make a packfile, it will blindly ACK all 'have' obj-ids
- back to the client.
-
- * the server will then send a 'NAK' and then wait for another response
- from the client - either a 'done' or another list of 'have' lines.
-
-In multi_ack_detailed mode:
-
- * the server will differentiate the ACKs where it is signaling
- that it is ready to send data with 'ACK obj-id ready' lines, and
- signals the identified common commits with 'ACK obj-id common' lines.
-
-Without either multi_ack or multi_ack_detailed:
-
- * upload-pack sends "ACK obj-id" on the first common object it finds.
- After that it says nothing until the client gives it a "done".
-
- * upload-pack sends "NAK" on a flush-pkt if no common object
- has been found yet. If one has been found, and thus an ACK
- was already sent, it's silent on the flush-pkt.
-
-After the client has gotten enough ACK responses that it can determine
-that the server has enough information to send an efficient packfile
-(in the canonical implementation, this is determined when it has received
-enough ACKs that it can color everything left in the --date-order queue
-as common with the server, or the --date-order queue is empty), or the
-client determines that it wants to give up (in the canonical implementation,
-this is determined when the client sends 256 'have' lines without getting
-any of them ACKed by the server - meaning there is nothing in common and
-the server should just send all of its objects), then the client will send
-a 'done' command. The 'done' command signals to the server that the client
-is ready to receive its packfile data.
-
-However, the 256 limit *only* turns on in the canonical client
-implementation if we have received at least one "ACK %s continue"
-during a prior round. This helps to ensure that at least one common
-ancestor is found before we give up entirely.
-
-Once the 'done' line is read from the client, the server will either
-send a final 'ACK obj-id' or it will send a 'NAK'. 'obj-id' is the object
-name of the last commit determined to be common. The server only sends
-ACK after 'done' if there is at least one common base and multi_ack or
-multi_ack_detailed is enabled. The server always sends NAK after 'done'
-if there is no common base found.
-
-Instead of 'ACK' or 'NAK', the server may send an error message (for
-example, if it does not recognize an object in a 'want' line received
-from the client).
-
-Then the server will start sending its packfile data.
-
-----
- server-response = *ack_multi ack / nak
- ack_multi = PKT-LINE("ACK" SP obj-id ack_status)
- ack_status = "continue" / "common" / "ready"
- ack = PKT-LINE("ACK" SP obj-id)
- nak = PKT-LINE("NAK")
-----
-
-A simple clone may look like this (with no 'have' lines):
-
-----
- C: 0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack \
- side-band-64k ofs-delta\n
- C: 0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\n
- C: 0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\n
- C: 0032want 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\n
- C: 0032want 74730d410fcb6603ace96f1dc55ea6196122532d\n
- C: 0000
- C: 0009done\n
-
- S: 0008NAK\n
- S: [PACKFILE]
-----
-
-An incremental update (fetch) response might look like this:
-
-----
- C: 0054want 74730d410fcb6603ace96f1dc55ea6196122532d multi_ack \
- side-band-64k ofs-delta\n
- C: 0032want 7d1665144a3a975c05f1f43902ddaf084e784dbe\n
- C: 0032want 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a\n
- C: 0000
- C: 0032have 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01\n
- C: [30 more have lines]
- C: 0032have 74730d410fcb6603ace96f1dc55ea6196122532d\n
- C: 0000
-
- S: 003aACK 7e47fe2bd8d01d481f44d7af0531bd93d3b21c01 continue\n
- S: 003aACK 74730d410fcb6603ace96f1dc55ea6196122532d continue\n
- S: 0008NAK\n
-
- C: 0009done\n
-
- S: 0031ACK 74730d410fcb6603ace96f1dc55ea6196122532d\n
- S: [PACKFILE]
-----
-
-
-Packfile Data
--------------
-
-Now that the client and server have finished negotiation about what
-the minimal amount of data that needs to be sent to the client is, the server
-will construct and send the required data in packfile format.
-
-See pack-format.txt for what the packfile itself actually looks like.
-
-If 'side-band' or 'side-band-64k' capabilities have been specified by
-the client, the server will send the packfile data multiplexed.
-
-Each packet starting with the packet-line length of the amount of data
-that follows, followed by a single byte specifying the sideband the
-following data is coming in on.
-
-In 'side-band' mode, it will send up to 999 data bytes plus 1 control
-code, for a total of up to 1000 bytes in a pkt-line. In 'side-band-64k'
-mode it will send up to 65519 data bytes plus 1 control code, for a
-total of up to 65520 bytes in a pkt-line.
-
-The sideband byte will be a '1', '2' or a '3'. Sideband '1' will contain
-packfile data, sideband '2' will be used for progress information that the
-client will generally print to stderr and sideband '3' is used for error
-information.
-
-If no 'side-band' capability was specified, the server will stream the
-entire packfile without multiplexing.
-
-
-Pushing Data To a Server
-------------------------
-
-Pushing data to a server will invoke the 'receive-pack' process on the
-server, which will allow the client to tell it which references it should
-update and then send all the data the server will need for those new
-references to be complete. Once all the data is received and validated,
-the server will then update its references to what the client specified.
-
-Authentication
---------------
-
-The protocol itself contains no authentication mechanisms. That is to be
-handled by the transport, such as SSH, before the 'receive-pack' process is
-invoked. If 'receive-pack' is configured over the Git transport, those
-repositories will be writable by anyone who can access that port (9418) as
-that transport is unauthenticated.
-
-Reference Discovery
--------------------
-
-The reference discovery phase is done nearly the same way as it is in the
-fetching protocol. Each reference obj-id and name on the server is sent
-in packet-line format to the client, followed by a flush-pkt. The only
-real difference is that the capability listing is different - the only
-possible values are 'report-status', 'delete-refs', 'ofs-delta' and
-'push-options'.
-
-Reference Update Request and Packfile Transfer
-----------------------------------------------
-
-Once the client knows what references the server is at, it can send a
-list of reference update requests. For each reference on the server
-that it wants to update, it sends a line listing the obj-id currently on
-the server, the obj-id the client would like to update it to and the name
-of the reference.
-
-This list is followed by a flush-pkt.
-
-----
- update-requests = *shallow ( command-list | push-cert )
-
- shallow = PKT-LINE("shallow" SP obj-id)
-
- command-list = PKT-LINE(command NUL capability-list)
- *PKT-LINE(command)
- flush-pkt
-
- command = create / delete / update
- create = zero-id SP new-id SP name
- delete = old-id SP zero-id SP name
- update = old-id SP new-id SP name
-
- old-id = obj-id
- new-id = obj-id
-
- push-cert = PKT-LINE("push-cert" NUL capability-list LF)
- PKT-LINE("certificate version 0.1" LF)
- PKT-LINE("pusher" SP ident LF)
- PKT-LINE("pushee" SP url LF)
- PKT-LINE("nonce" SP nonce LF)
- *PKT-LINE("push-option" SP push-option LF)
- PKT-LINE(LF)
- *PKT-LINE(command LF)
- *PKT-LINE(gpg-signature-lines LF)
- PKT-LINE("push-cert-end" LF)
-
- push-option = 1*( VCHAR | SP )
-----
-
-If the server has advertised the 'push-options' capability and the client has
-specified 'push-options' as part of the capability list above, the client then
-sends its push options followed by a flush-pkt.
-
-----
- push-options = *PKT-LINE(push-option) flush-pkt
-----
-
-For backwards compatibility with older Git servers, if the client sends a push
-cert and push options, it MUST send its push options both embedded within the
-push cert and after the push cert. (Note that the push options within the cert
-are prefixed, but the push options after the cert are not.) Both these lists
-MUST be the same, modulo the prefix.
-
-After that the packfile that
-should contain all the objects that the server will need to complete the new
-references will be sent.
-
-----
- packfile = "PACK" 28*(OCTET)
-----
-
-If the receiving end does not support delete-refs, the sending end MUST
-NOT ask for delete command.
-
-If the receiving end does not support push-cert, the sending end
-MUST NOT send a push-cert command. When a push-cert command is
-sent, command-list MUST NOT be sent; the commands recorded in the
-push certificate is used instead.
-
-The packfile MUST NOT be sent if the only command used is 'delete'.
-
-A packfile MUST be sent if either create or update command is used,
-even if the server already has all the necessary objects. In this
-case the client MUST send an empty packfile. The only time this
-is likely to happen is if the client is creating
-a new branch or a tag that points to an existing obj-id.
-
-The server will receive the packfile, unpack it, then validate each
-reference that is being updated that it hasn't changed while the request
-was being processed (the obj-id is still the same as the old-id), and
-it will run any update hooks to make sure that the update is acceptable.
-If all of that is fine, the server will then update the references.
-
-Push Certificate
-----------------
-
-A push certificate begins with a set of header lines. After the
-header and an empty line, the protocol commands follow, one per
-line. Note that the trailing LF in push-cert PKT-LINEs is _not_
-optional; it must be present.
-
-Currently, the following header fields are defined:
-
-`pusher` ident::
- Identify the GPG key in "Human Readable Name <email@address>"
- format.
-
-`pushee` url::
- The repository URL (anonymized, if the URL contains
- authentication material) the user who ran `git push`
- intended to push into.
-
-`nonce` nonce::
- The 'nonce' string the receiving repository asked the
- pushing user to include in the certificate, to prevent
- replay attacks.
-
-The GPG signature lines are a detached signature for the contents
-recorded in the push certificate before the signature block begins.
-The detached signature is used to certify that the commands were
-given by the pusher, who must be the signer.
-
-Report Status
--------------
-
-After receiving the pack data from the sender, the receiver sends a
-report if 'report-status' capability is in effect.
-It is a short listing of what happened in that update. It will first
-list the status of the packfile unpacking as either 'unpack ok' or
-'unpack [error]'. Then it will list the status for each of the references
-that it tried to update. Each line is either 'ok [refname]' if the
-update was successful, or 'ng [refname] [error]' if the update was not.
-
-----
- report-status = unpack-status
- 1*(command-status)
- flush-pkt
-
- unpack-status = PKT-LINE("unpack" SP unpack-result)
- unpack-result = "ok" / error-msg
-
- command-status = command-ok / command-fail
- command-ok = PKT-LINE("ok" SP refname)
- command-fail = PKT-LINE("ng" SP refname SP error-msg)
-
- error-msg = 1*(OCTECT) ; where not "ok"
-----
-
-Updates can be unsuccessful for a number of reasons. The reference can have
-changed since the reference discovery phase was originally sent, meaning
-someone pushed in the meantime. The reference being pushed could be a
-non-fast-forward reference and the update hooks or configuration could be
-set to not allow that, etc. Also, some references can be updated while others
-can be rejected.
-
-An example client/server communication might look like this:
-
-----
- S: 006274730d410fcb6603ace96f1dc55ea6196122532d refs/heads/local\0report-status delete-refs ofs-delta\n
- S: 003e7d1665144a3a975c05f1f43902ddaf084e784dbe refs/heads/debug\n
- S: 003f74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/master\n
- S: 003d74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/team\n
- S: 0000
-
- C: 00677d1665144a3a975c05f1f43902ddaf084e784dbe 74730d410fcb6603ace96f1dc55ea6196122532d refs/heads/debug\n
- C: 006874730d410fcb6603ace96f1dc55ea6196122532d 5a3f6be755bbb7deae50065988cbfa1ffa9ab68a refs/heads/master\n
- C: 0000
- C: [PACKDATA]
-
- S: 000eunpack ok\n
- S: 0018ok refs/heads/debug\n
- S: 002ang refs/heads/master non-fast-forward\n
-----
diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt
new file mode 100644
index 0000000..9d453d4
--- /dev/null
+++ b/Documentation/technical/packfile-uri.txt
@@ -0,0 +1,82 @@
+Packfile URIs
+=============
+
+This feature allows servers to serve part of their packfile response as URIs.
+This allows server designs that improve scalability in bandwidth and CPU usage
+(for example, by serving some data through a CDN), and (in the future) provides
+some measure of resumability to clients.
+
+This feature is available only in protocol version 2.
+
+Protocol
+--------
+
+The server advertises the `packfile-uris` capability.
+
+If the client then communicates which protocols (HTTPS, etc.) it supports with
+a `packfile-uris` argument, the server MAY send a `packfile-uris` section
+directly before the `packfile` section (right after `wanted-refs` if it is
+sent) containing URIs of any of the given protocols. The URIs point to
+packfiles that use only features that the client has declared that it supports
+(e.g. ofs-delta and thin-pack). See linkgit:gitprotocol-v2[5] for the documentation of
+this section.
+
+Clients should then download and index all the given URIs (in addition to
+downloading and indexing the packfile given in the `packfile` section of the
+response) before performing the connectivity check.
+
+Server design
+-------------
+
+The server can be trivially made compatible with the proposed protocol by
+having it advertise `packfile-uris`, tolerating the client sending
+`packfile-uris`, and never sending any `packfile-uris` section. But we should
+include some sort of non-trivial implementation in the Minimum Viable Product,
+at least so that we can test the client.
+
+This is the implementation: a feature, marked experimental, that allows the
+server to be configured by one or more `uploadpack.blobPackfileUri=
+<object-hash> <pack-hash> <uri>` entries. Whenever the list of objects to be
+sent is assembled, all such blobs are excluded, replaced with URIs. As noted
+in "Future work" below, the server can evolve in the future to support
+excluding other objects (or other implementations of servers could be made
+that support excluding other objects) without needing a protocol change, so
+clients should not expect that packfiles downloaded in this way only contain
+single blobs.
+
+Client design
+-------------
+
+The client has a config variable `fetch.uriprotocols` that determines which
+protocols the end user is willing to use. By default, this is empty.
+
+When the client downloads the given URIs, it should store them with "keep"
+files, just like it does with the packfile in the `packfile` section. These
+additional "keep" files can only be removed after the refs have been updated -
+just like the "keep" file for the packfile in the `packfile` section.
+
+The division of work (initial fetch + additional URIs) introduces convenient
+points for resumption of an interrupted clone - such resumption can be done
+after the Minimum Viable Product (see "Future work").
+
+Future work
+-----------
+
+The protocol design allows some evolution of the server and client without any
+need for protocol changes, so only a small-scoped design is included here to
+form the MVP. For example, the following can be done:
+
+ * On the server, more sophisticated means of excluding objects (e.g. by
+ specifying a commit to represent that commit and all objects that it
+ references).
+ * On the client, resumption of clone. If a clone is interrupted, information
+ could be recorded in the repository's config and a "clone-resume" command
+ can resume the clone in progress. (Resumption of subsequent fetches is more
+ difficult because that must deal with the user wanting to use the repository
+ even after the fetch was interrupted.)
+
+There are some possible features that will require a change in protocol:
+
+ * Additional HTTP headers (e.g. authentication)
+ * Byte range support
+ * Different file formats referenced by URIs (e.g. raw object)
diff --git a/Documentation/technical/parallel-checkout.txt b/Documentation/technical/parallel-checkout.txt
new file mode 100644
index 0000000..b4a144e
--- /dev/null
+++ b/Documentation/technical/parallel-checkout.txt
@@ -0,0 +1,270 @@
+Parallel Checkout Design Notes
+==============================
+
+The "Parallel Checkout" feature attempts to use multiple processes to
+parallelize the work of uncompressing the blobs, applying in-core
+filters, and writing the resulting contents to the working tree during a
+checkout operation. It can be used by all checkout-related commands,
+such as `clone`, `checkout`, `reset`, `sparse-checkout`, and others.
+
+These commands share the following basic structure:
+
+* Step 1: Read the current index file into memory.
+
+* Step 2: Modify the in-memory index based upon the command, and
+ temporarily mark all cache entries that need to be updated.
+
+* Step 3: Populate the working tree to match the new candidate index.
+ This includes iterating over all of the to-be-updated cache entries
+ and delete, create, or overwrite the associated files in the working
+ tree.
+
+* Step 4: Write the new index to disk.
+
+Step 3 is the focus of the "parallel checkout" effort described here.
+
+Sequential Implementation
+-------------------------
+
+For the purposes of discussion here, the current sequential
+implementation of Step 3 is divided in 3 parts, each one implemented in
+its own function:
+
+* Step 3a: `unpack-trees.c:check_updates()` contains a series of
+ sequential loops iterating over the `cache_entry`'s array. The main
+ loop in this function calls the Step 3b function for each of the
+ to-be-updated entries.
+
+* Step 3b: `entry.c:checkout_entry()` examines the existing working tree
+ for file conflicts, collisions, and unsaved changes. It removes files
+ and creates leading directories as necessary. It calls the Step 3c
+ function for each entry to be written.
+
+* Step 3c: `entry.c:write_entry()` loads the blob into memory, smudges
+ it if necessary, creates the file in the working tree, writes the
+ smudged contents, calls `fstat()` or `lstat()`, and updates the
+ associated `cache_entry` struct with the stat information gathered.
+
+It wouldn't be safe to perform Step 3b in parallel, as there could be
+race conditions between file creations and removals. Instead, the
+parallel checkout framework lets the sequential code handle Step 3b,
+and uses parallel workers to replace the sequential
+`entry.c:write_entry()` calls from Step 3c.
+
+Rejected Multi-Threaded Solution
+--------------------------------
+
+The most "straightforward" implementation would be to spread the set of
+to-be-updated cache entries across multiple threads. But due to the
+thread-unsafe functions in the object database code, we would have to use locks to
+coordinate the parallel operation. An early prototype of this solution
+showed that the multi-threaded checkout would bring performance
+improvements over the sequential code, but there was still too much lock
+contention. A `perf` profiling indicated that around 20% of the runtime
+during a local Linux clone (on an SSD) was spent in locking functions.
+For this reason this approach was rejected in favor of using multiple
+child processes, which led to better performance.
+
+Multi-Process Solution
+----------------------
+
+Parallel checkout alters the aforementioned Step 3 to use multiple
+`checkout--worker` background processes to distribute the work. The
+long-running worker processes are controlled by the foreground Git
+command using the existing run-command API.
+
+Overview
+~~~~~~~~
+
+Step 3b is only slightly altered; for each entry to be checked out, the
+main process performs the following steps:
+
+* M1: Check whether there is any untracked or unclean file in the
+ working tree which would be overwritten by this entry, and decide
+ whether to proceed (removing the file(s)) or not.
+
+* M2: Create the leading directories.
+
+* M3: Load the conversion attributes for the entry's path.
+
+* M4: Check, based on the entry's type and conversion attributes,
+ whether the entry is eligible for parallel checkout (more on this
+ later). If it is eligible, enqueue the entry and the loaded
+ attributes to later write the entry in parallel. If not, write the
+ entry right away, using the default sequential code.
+
+Note: we save the conversion attributes associated with each entry
+because the workers don't have access to the main process' index state,
+so they can't load the attributes by themselves (and the attributes are
+needed to properly smudge the entry). Additionally, this has a positive
+impact on performance as (1) we don't need to load the attributes twice
+and (2) the attributes machinery is optimized to handle paths in
+sequential order.
+
+After all entries have passed through the above steps, the main process
+checks if the number of enqueued entries is sufficient to spread among
+the workers. If not, it just writes them sequentially. Otherwise, it
+spawns the workers and distributes the queued entries uniformly in
+continuous chunks. This aims to minimize the chances of two workers
+writing to the same directory simultaneously, which could increase lock
+contention in the kernel.
+
+Then, for each assigned item, each worker:
+
+* W1: Checks if there is any non-directory file in the leading part of
+ the entry's path or if there already exists a file at the entry' path.
+ If so, mark the entry with `PC_ITEM_COLLIDED` and skip it (more on
+ this later).
+
+* W2: Creates the file (with O_CREAT and O_EXCL).
+
+* W3: Loads the blob into memory (inflating and delta reconstructing
+ it).
+
+* W4: Applies any required in-process filter, like end-of-line
+ conversion and re-encoding.
+
+* W5: Writes the result to the file descriptor opened at W2.
+
+* W6: Calls `fstat()` or `lstat()` on the just-written path, and sends
+ the result back to the main process, together with the end status of
+ the operation and the item's identification number.
+
+Note that, when possible, steps W3 to W5 are delegated to the streaming
+machinery, removing the need to keep the entire blob in memory.
+
+If the worker fails to read the blob or to write it to the working tree,
+it removes the created file to avoid leaving empty files behind. This is
+the *only* time a worker is allowed to remove a file.
+
+As mentioned earlier, it is the responsibility of the main process to
+remove any file that blocks the checkout operation (or abort if the
+removal(s) would cause data loss and the user didn't ask to `--force`).
+This is crucial to avoid race conditions and also to properly detect
+path collisions at Step W1.
+
+After the workers finish writing the items and sending back the required
+information, the main process handles the results in two steps:
+
+- First, it updates the in-memory index with the `lstat()` information
+ sent by the workers. (This must be done first as this information
+ might be required in the following step.)
+
+- Then it writes the items which collided on disk (i.e. items marked
+ with `PC_ITEM_COLLIDED`). More on this below.
+
+Path Collisions
+---------------
+
+Path collisions happen when two different paths correspond to the same
+entry in the file system. E.g. the paths 'a' and 'A' would collide in a
+case-insensitive file system.
+
+The sequential checkout deals with collisions in the same way that it
+deals with files that were already present in the working tree before
+checkout. Basically, it checks if the path that it wants to write
+already exists on disk, makes sure the existing file doesn't have
+unsaved data, and then overwrites it. (To be more pedantic: it deletes
+the existing file and creates the new one.) So, if there are multiple
+colliding files to be checked out, the sequential code will write each
+one of them but only the last will actually survive on disk.
+
+Parallel checkout aims to reproduce the same behavior. However, we
+cannot let the workers racily write to the same file on disk. Instead,
+the workers detect when the entry that they want to check out would
+collide with an existing file, and mark it with `PC_ITEM_COLLIDED`.
+Later, the main process can sequentially feed these entries back to
+`checkout_entry()` without the risk of race conditions. On clone, this
+also has the effect of marking the colliding entries to later emit a
+warning for the user, like the classic sequential checkout does.
+
+The workers are able to detect both collisions among the entries being
+concurrently written and collisions between a parallel-eligible entry
+and an ineligible entry. The general idea for collision detection is
+quite straightforward: for each parallel-eligible entry, the main
+process must remove all files that prevent this entry from being written
+(before enqueueing it). This includes any non-directory file in the
+leading path of the entry. Later, when a worker gets assigned the entry,
+it looks again for the non-directory files and for an already existing
+file at the entry's path. If any of these checks finds something, the
+worker knows that there was a path collision.
+
+Because parallel checkout can distinguish path collisions from the case
+where the file was already present in the working tree before checkout,
+we could alternatively choose to skip the checkout of colliding entries.
+However, each entry that doesn't get written would have NULL `lstat()`
+fields on the index. This could cause performance penalties for
+subsequent commands that need to refresh the index, as they would have
+to go to the file system to see if the entry is dirty. Thus, if we have
+N entries in a colliding group and we decide to write and `lstat()` only
+one of them, every subsequent `git-status` will have to read, convert,
+and hash the written file N - 1 times. By checking out all colliding
+entries (like the sequential code does), we only pay the overhead once,
+during checkout.
+
+Eligible Entries for Parallel Checkout
+--------------------------------------
+
+As previously mentioned, not all entries passed to `checkout_entry()`
+will be considered eligible for parallel checkout. More specifically, we
+exclude:
+
+- Symbolic links; to avoid race conditions that, in combination with
+ path collisions, could cause workers to write files at the wrong
+ place. For example, if we were to concurrently check out a symlink
+ 'a' -> 'b' and a regular file 'A/f' in a case-insensitive file system,
+ we could potentially end up writing the file 'A/f' at 'a/f', due to a
+ race condition.
+
+- Regular files that require external filters (either "one shot" filters
+ or long-running process filters). These filters are black-boxes to Git
+ and may have their own internal locking or non-concurrent assumptions.
+ So it might not be safe to run multiple instances in parallel.
++
+Besides, long-running filters may use the delayed checkout feature to
+postpone the return of some filtered blobs. The delayed checkout queue
+and the parallel checkout queue are not compatible and should remain
+separate.
++
+Note: regular files that only require internal filters, like end-of-line
+conversion and re-encoding, are eligible for parallel checkout.
+
+Ineligible entries are checked out by the classic sequential codepath
+*before* spawning workers.
+
+Note: submodules' files are also eligible for parallel checkout (as
+long as they don't fall into any of the excluding categories mentioned
+above). But since each submodule is checked out in its own child
+process, we don't mix the superproject's and the submodules' files in
+the same parallel checkout process or queue.
+
+The API
+-------
+
+The parallel checkout API was designed with the goal of minimizing
+changes to the current users of the checkout machinery. This means that
+they don't have to call a different function for sequential or parallel
+checkout. As already mentioned, `checkout_entry()` will automatically
+insert the given entry in the parallel checkout queue when this feature
+is enabled and the entry is eligible; otherwise, it will just write the
+entry right away, using the sequential code. In general, callers of the
+parallel checkout API should look similar to this:
+
+----------------------------------------------
+int pc_workers, pc_threshold, err = 0;
+struct checkout state;
+
+get_parallel_checkout_configs(&pc_workers, &pc_threshold);
+
+/*
+ * This check is not strictly required, but it
+ * should save some time in sequential mode.
+ */
+if (pc_workers > 1)
+ init_parallel_checkout();
+
+for (each cache_entry ce to-be-updated)
+ err |= checkout_entry(ce, &state, NULL, NULL);
+
+err |= run_parallel_checkout(&state, pc_workers, pc_threshold, NULL, NULL);
+----------------------------------------------
diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt
index 210373e..cd948b0 100644
--- a/Documentation/technical/partial-clone.txt
+++ b/Documentation/technical/partial-clone.txt
@@ -3,7 +3,7 @@ Partial Clone Design Notes
The "Partial Clone" feature is a performance optimization for Git that
allows Git to function without having a complete copy of the repository.
-The goal of this work is to allow Git better handle extremely large
+The goal of this work is to allow Git to better handle extremely large
repositories.
During clone and fetch operations, Git downloads the complete contents
@@ -32,7 +32,7 @@ if/when needed.
A remote that can later provide the missing objects is called a
promisor remote, as it promises to send the objects when
-requested. Initialy Git supported only one promisor remote, the origin
+requested. Initially Git supported only one promisor remote, the origin
remote from which the user cloned and that was configured in the
"extensions.partialClone" config option. Later support for more than
one promisor remote has been implemented.
@@ -79,7 +79,7 @@ Design Details
upload-pack negotiation.
+
This uses the existing capability discovery mechanism.
-See "filter" in Documentation/technical/pack-protocol.txt.
+See "filter" in linkgit:gitprotocol-pack[5].
- Clients pass a "filter-spec" to clone and fetch which is passed to the
server to request filtering during packfile construction.
@@ -171,23 +171,19 @@ additional flag.
Fetching Missing Objects
------------------------
-- Fetching of objects is done using the existing transport mechanism using
- transport_fetch_refs(), setting a new transport option
- TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are
- desired, not any object that they refer to.
-+
-Because some transports invoke fetch_pack() in the same process, fetch_pack()
-has been updated to not use any object flags when the corresponding argument
-(no_dependents) is set.
+- Fetching of objects is done by invoking a "git fetch" subprocess.
- The local repository sends a request with the hashes of all requested
- objects as "want" lines, and does not perform any packfile negotiation.
+ objects, and does not perform any packfile negotiation.
It then receives a packfile.
-- Because we are reusing the existing fetch-pack mechanism, fetching
+- Because we are reusing the existing fetch mechanism, fetching
currently fetches all objects referred to by the requested objects, even
though they are not necessary.
+- Fetching with `--refetch` will request a complete new filtered packfile from
+ the remote, which can be used to change a filter without needing to
+ dynamically fetch missing objects.
Using many promisor remotes
---------------------------
@@ -249,8 +245,7 @@ remote in a specific order.
repository and can satisfy all such requests.
- Repack essentially treats promisor and non-promisor packfiles as 2
- distinct partitions and does not mix them. Repack currently only works
- on non-promisor packfiles and loose objects.
+ distinct partitions and does not mix them.
- Dynamic object fetching invokes fetch-pack once *for each item*
because most algorithms stumble upon a missing object and need to have
@@ -261,7 +256,7 @@ remote in a specific order.
- Dynamic object fetching currently uses the existing pack protocol V0
which means that each object is requested via fetch-pack. The server
will send a full set of info/refs when the connection is established.
- If there are large number of refs, this may incur significant overhead.
+ If there are a large number of refs, this may incur significant overhead.
Future Work
@@ -270,7 +265,7 @@ Future Work
- Improve the way to specify the order in which promisor remotes are
tried.
+
-For example this could allow to specify explicitly something like:
+For example this could allow specifying explicitly something like:
"When fetching from this remote, I want to use these promisor remotes
in this order, though, when pushing or fetching to that remote, I want
to use those promisor remotes in that order."
@@ -280,9 +275,6 @@ to use those promisor remotes in that order."
The user might want to work in a triangular work flow with multiple
promisor remotes that each have an incomplete view of the repository.
-- Allow repack to work on promisor packfiles (while keeping them distinct
- from non-promisor packfiles).
-
- Allow non-pathname-based filters to make use of packfile bitmaps (when
present). This was just an omission during the initial implementation.
@@ -330,7 +322,7 @@ Footnotes
[a] expensive-to-modify list of missing objects: Earlier in the design of
partial clone we discussed the need for a single list of missing objects.
- This would essentially be a sorted linear list of OIDs that the were
+ This would essentially be a sorted linear list of OIDs that were
omitted by the server during a clone or subsequent fetches.
This file would need to be loaded into memory on every object lookup.
@@ -350,26 +342,26 @@ Related Links
[0] https://crbug.com/git/2
Bug#2: Partial Clone
-[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/ +
+[1] https://lore.kernel.org/git/20170113155253.1644-1-benpeart@microsoft.com/ +
Subject: [RFC] Add support for downloading blobs on demand +
Date: Fri, 13 Jan 2017 10:52:53 -0500
-[2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/ +
+[2] https://lore.kernel.org/git/cover.1506714999.git.jonathantanmy@google.com/ +
Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) +
Date: Fri, 29 Sep 2017 13:11:36 -0700
-[3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/ +
+[3] https://lore.kernel.org/git/20170426221346.25337-1-jonathantanmy@google.com/ +
Subject: Proposal for missing blob support in Git repos +
Date: Wed, 26 Apr 2017 15:13:46 -0700
-[4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ +
+[4] https://lore.kernel.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ +
Subject: [PATCH 00/10] RFC Partial Clone and Fetch +
Date: Wed, 8 Mar 2017 18:50:29 +0000
-[5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/ +
+[5] https://lore.kernel.org/git/20170505152802.6724-1-benpeart@microsoft.com/ +
Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module +
Date: Fri, 5 May 2017 11:27:52 -0400
-[6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/ +
+[6] https://lore.kernel.org/git/20170714132651.170708-1-benpeart@microsoft.com/ +
Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand +
Date: Fri, 14 Jul 2017 09:26:50 -0400
diff --git a/Documentation/technical/protocol-capabilities.txt b/Documentation/technical/protocol-capabilities.txt
deleted file mode 100644
index 2b267c0..0000000
--- a/Documentation/technical/protocol-capabilities.txt
+++ /dev/null
@@ -1,337 +0,0 @@
-Git Protocol Capabilities
-=========================
-
-NOTE: this document describes capabilities for versions 0 and 1 of the pack
-protocol. For version 2, please refer to the link:protocol-v2.html[protocol-v2]
-doc.
-
-Servers SHOULD support all capabilities defined in this document.
-
-On the very first line of the initial server response of either
-receive-pack and upload-pack the first reference is followed by
-a NUL byte and then a list of space delimited server capabilities.
-These allow the server to declare what it can and cannot support
-to the client.
-
-Client will then send a space separated list of capabilities it wants
-to be in effect. The client MUST NOT ask for capabilities the server
-did not say it supports.
-
-Server MUST diagnose and abort if capabilities it does not understand
-was sent. Server MUST NOT ignore capabilities that client requested
-and server advertised. As a consequence of these rules, server MUST
-NOT advertise capabilities it does not understand.
-
-The 'atomic', 'report-status', 'delete-refs', 'quiet', and 'push-cert'
-capabilities are sent and recognized by the receive-pack (push to server)
-process.
-
-The 'ofs-delta' and 'side-band-64k' capabilities are sent and recognized
-by both upload-pack and receive-pack protocols. The 'agent' capability
-may optionally be sent in both protocols.
-
-All other capabilities are only recognized by the upload-pack (fetch
-from server) process.
-
-multi_ack
----------
-
-The 'multi_ack' capability allows the server to return "ACK obj-id
-continue" as soon as it finds a commit that it can use as a common
-base, between the client's wants and the client's have set.
-
-By sending this early, the server can potentially head off the client
-from walking any further down that particular branch of the client's
-repository history. The client may still need to walk down other
-branches, sending have lines for those, until the server has a
-complete cut across the DAG, or the client has said "done".
-
-Without multi_ack, a client sends have lines in --date-order until
-the server has found a common base. That means the client will send
-have lines that are already known by the server to be common, because
-they overlap in time with another branch that the server hasn't found
-a common base on yet.
-
-For example suppose the client has commits in caps that the server
-doesn't and the server has commits in lower case that the client
-doesn't, as in the following diagram:
-
- +---- u ---------------------- x
- / +----- y
- / /
- a -- b -- c -- d -- E -- F
- \
- +--- Q -- R -- S
-
-If the client wants x,y and starts out by saying have F,S, the server
-doesn't know what F,S is. Eventually the client says "have d" and
-the server sends "ACK d continue" to let the client know to stop
-walking down that line (so don't send c-b-a), but it's not done yet,
-it needs a base for x. The client keeps going with S-R-Q, until a
-gets reached, at which point the server has a clear base and it all
-ends.
-
-Without multi_ack the client would have sent that c-b-a chain anyway,
-interleaved with S-R-Q.
-
-multi_ack_detailed
-------------------
-This is an extension of multi_ack that permits client to better
-understand the server's in-memory state. See pack-protocol.txt,
-section "Packfile Negotiation" for more information.
-
-no-done
--------
-This capability should only be used with the smart HTTP protocol. If
-multi_ack_detailed and no-done are both present, then the sender is
-free to immediately send a pack following its first "ACK obj-id ready"
-message.
-
-Without no-done in the smart HTTP protocol, the server session would
-end and the client has to make another trip to send "done" before
-the server can send the pack. no-done removes the last round and
-thus slightly reduces latency.
-
-thin-pack
----------
-
-A thin pack is one with deltas which reference base objects not
-contained within the pack (but are known to exist at the receiving
-end). This can reduce the network traffic significantly, but it
-requires the receiving end to know how to "thicken" these packs by
-adding the missing bases to the pack.
-
-The upload-pack server advertises 'thin-pack' when it can generate
-and send a thin pack. A client requests the 'thin-pack' capability
-when it understands how to "thicken" it, notifying the server that
-it can receive such a pack. A client MUST NOT request the
-'thin-pack' capability if it cannot turn a thin pack into a
-self-contained pack.
-
-Receive-pack, on the other hand, is assumed by default to be able to
-handle thin packs, but can ask the client not to use the feature by
-advertising the 'no-thin' capability. A client MUST NOT send a thin
-pack if the server advertises the 'no-thin' capability.
-
-The reasons for this asymmetry are historical. The receive-pack
-program did not exist until after the invention of thin packs, so
-historically the reference implementation of receive-pack always
-understood thin packs. Adding 'no-thin' later allowed receive-pack
-to disable the feature in a backwards-compatible manner.
-
-
-side-band, side-band-64k
-------------------------
-
-This capability means that server can send, and client understand multiplexed
-progress reports and error info interleaved with the packfile itself.
-
-These two options are mutually exclusive. A modern client always
-favors 'side-band-64k'.
-
-Either mode indicates that the packfile data will be streamed broken
-up into packets of up to either 1000 bytes in the case of 'side_band',
-or 65520 bytes in the case of 'side_band_64k'. Each packet is made up
-of a leading 4-byte pkt-line length of how much data is in the packet,
-followed by a 1-byte stream code, followed by the actual data.
-
-The stream code can be one of:
-
- 1 - pack data
- 2 - progress messages
- 3 - fatal error message just before stream aborts
-
-The "side-band-64k" capability came about as a way for newer clients
-that can handle much larger packets to request packets that are
-actually crammed nearly full, while maintaining backward compatibility
-for the older clients.
-
-Further, with side-band and its up to 1000-byte messages, it's actually
-999 bytes of payload and 1 byte for the stream code. With side-band-64k,
-same deal, you have up to 65519 bytes of data and 1 byte for the stream
-code.
-
-The client MUST send only maximum of one of "side-band" and "side-
-band-64k". Server MUST diagnose it as an error if client requests
-both.
-
-ofs-delta
----------
-
-Server can send, and client understand PACKv2 with delta referring to
-its base by position in pack rather than by an obj-id. That is, they can
-send/read OBJ_OFS_DELTA (aka type 6) in a packfile.
-
-agent
------
-
-The server may optionally send a capability of the form `agent=X` to
-notify the client that the server is running version `X`. The client may
-optionally return its own agent string by responding with an `agent=Y`
-capability (but it MUST NOT do so if the server did not mention the
-agent capability). The `X` and `Y` strings may contain any printable
-ASCII characters except space (i.e., the byte range 32 < x < 127), and
-are typically of the form "package/version" (e.g., "git/1.8.3.1"). The
-agent strings are purely informative for statistics and debugging
-purposes, and MUST NOT be used to programmatically assume the presence
-or absence of particular features.
-
-symref
-------
-
-This parameterized capability is used to inform the receiver which symbolic ref
-points to which ref; for example, "symref=HEAD:refs/heads/master" tells the
-receiver that HEAD points to master. This capability can be repeated to
-represent multiple symrefs.
-
-Servers SHOULD include this capability for the HEAD symref if it is one of the
-refs being sent.
-
-Clients MAY use the parameters from this capability to select the proper initial
-branch when cloning a repository.
-
-shallow
--------
-
-This capability adds "deepen", "shallow" and "unshallow" commands to
-the fetch-pack/upload-pack protocol so clients can request shallow
-clones.
-
-deepen-since
-------------
-
-This capability adds "deepen-since" command to fetch-pack/upload-pack
-protocol so the client can request shallow clones that are cut at a
-specific time, instead of depth. Internally it's equivalent of doing
-"rev-list --max-age=<timestamp>" on the server side. "deepen-since"
-cannot be used with "deepen".
-
-deepen-not
-----------
-
-This capability adds "deepen-not" command to fetch-pack/upload-pack
-protocol so the client can request shallow clones that are cut at a
-specific revision, instead of depth. Internally it's equivalent of
-doing "rev-list --not <rev>" on the server side. "deepen-not"
-cannot be used with "deepen", but can be used with "deepen-since".
-
-deepen-relative
----------------
-
-If this capability is requested by the client, the semantics of
-"deepen" command is changed. The "depth" argument is the depth from
-the current shallow boundary, instead of the depth from remote refs.
-
-no-progress
------------
-
-The client was started with "git clone -q" or something, and doesn't
-want that side band 2. Basically the client just says "I do not
-wish to receive stream 2 on sideband, so do not send it to me, and if
-you did, I will drop it on the floor anyway". However, the sideband
-channel 3 is still used for error responses.
-
-include-tag
------------
-
-The 'include-tag' capability is about sending annotated tags if we are
-sending objects they point to. If we pack an object to the client, and
-a tag object points exactly at that object, we pack the tag object too.
-In general this allows a client to get all new annotated tags when it
-fetches a branch, in a single network connection.
-
-Clients MAY always send include-tag, hardcoding it into a request when
-the server advertises this capability. The decision for a client to
-request include-tag only has to do with the client's desires for tag
-data, whether or not a server had advertised objects in the
-refs/tags/* namespace.
-
-Servers MUST pack the tags if their referrant is packed and the client
-has requested include-tags.
-
-Clients MUST be prepared for the case where a server has ignored
-include-tag and has not actually sent tags in the pack. In such
-cases the client SHOULD issue a subsequent fetch to acquire the tags
-that include-tag would have otherwise given the client.
-
-The server SHOULD send include-tag, if it supports it, regardless
-of whether or not there are tags available.
-
-report-status
--------------
-
-The receive-pack process can receive a 'report-status' capability,
-which tells it that the client wants a report of what happened after
-a packfile upload and reference update. If the pushing client requests
-this capability, after unpacking and updating references the server
-will respond with whether the packfile unpacked successfully and if
-each reference was updated successfully. If any of those were not
-successful, it will send back an error message. See pack-protocol.txt
-for example messages.
-
-delete-refs
------------
-
-If the server sends back the 'delete-refs' capability, it means that
-it is capable of accepting a zero-id value as the target
-value of a reference update. It is not sent back by the client, it
-simply informs the client that it can be sent zero-id values
-to delete references.
-
-quiet
------
-
-If the receive-pack server advertises the 'quiet' capability, it is
-capable of silencing human-readable progress output which otherwise may
-be shown when processing the received pack. A send-pack client should
-respond with the 'quiet' capability to suppress server-side progress
-reporting if the local progress reporting is also being suppressed
-(e.g., via `push -q`, or if stderr does not go to a tty).
-
-atomic
-------
-
-If the server sends the 'atomic' capability it is capable of accepting
-atomic pushes. If the pushing client requests this capability, the server
-will update the refs in one atomic transaction. Either all refs are
-updated or none.
-
-push-options
-------------
-
-If the server sends the 'push-options' capability it is able to accept
-push options after the update commands have been sent, but before the
-packfile is streamed. If the pushing client requests this capability,
-the server will pass the options to the pre- and post- receive hooks
-that process this push request.
-
-allow-tip-sha1-in-want
-----------------------
-
-If the upload-pack server advertises this capability, fetch-pack may
-send "want" lines with SHA-1s that exist at the server but are not
-advertised by upload-pack.
-
-allow-reachable-sha1-in-want
-----------------------------
-
-If the upload-pack server advertises this capability, fetch-pack may
-send "want" lines with SHA-1s that exist at the server but are not
-advertised by upload-pack.
-
-push-cert=<nonce>
------------------
-
-The receive-pack server that advertises this capability is willing
-to accept a signed push certificate, and asks the <nonce> to be
-included in the push certificate. A send-pack client MUST NOT
-send a push-cert packet unless the receive-pack server advertises
-this capability.
-
-filter
-------
-
-If the upload-pack server advertises the 'filter' capability,
-fetch-pack may send "filter" commands to request a partial clone
-or partial fetch and request that the server omit various objects
-from the packfile.
diff --git a/Documentation/technical/protocol-common.txt b/Documentation/technical/protocol-common.txt
deleted file mode 100644
index ecedb34..0000000
--- a/Documentation/technical/protocol-common.txt
+++ /dev/null
@@ -1,99 +0,0 @@
-Documentation Common to Pack and Http Protocols
-===============================================
-
-ABNF Notation
--------------
-
-ABNF notation as described by RFC 5234 is used within the protocol documents,
-except the following replacement core rules are used:
-----
- HEXDIG = DIGIT / "a" / "b" / "c" / "d" / "e" / "f"
-----
-
-We also define the following common rules:
-----
- NUL = %x00
- zero-id = 40*"0"
- obj-id = 40*(HEXDIGIT)
-
- refname = "HEAD"
- refname /= "refs/" <see discussion below>
-----
-
-A refname is a hierarchical octet string beginning with "refs/" and
-not violating the 'git-check-ref-format' command's validation rules.
-More specifically, they:
-
-. They can include slash `/` for hierarchical (directory)
- grouping, but no slash-separated component can begin with a
- dot `.`.
-
-. They must contain at least one `/`. This enforces the presence of a
- category like `heads/`, `tags/` etc. but the actual names are not
- restricted.
-
-. They cannot have two consecutive dots `..` anywhere.
-
-. They cannot have ASCII control characters (i.e. bytes whose
- values are lower than \040, or \177 `DEL`), space, tilde `~`,
- caret `^`, colon `:`, question-mark `?`, asterisk `*`,
- or open bracket `[` anywhere.
-
-. They cannot end with a slash `/` or a dot `.`.
-
-. They cannot end with the sequence `.lock`.
-
-. They cannot contain a sequence `@{`.
-
-. They cannot contain a `\\`.
-
-
-pkt-line Format
----------------
-
-Much (but not all) of the payload is described around pkt-lines.
-
-A pkt-line is a variable length binary string. The first four bytes
-of the line, the pkt-len, indicates the total length of the line,
-in hexadecimal. The pkt-len includes the 4 bytes used to contain
-the length's hexadecimal representation.
-
-A pkt-line MAY contain binary data, so implementors MUST ensure
-pkt-line parsing/formatting routines are 8-bit clean.
-
-A non-binary line SHOULD BE terminated by an LF, which if present
-MUST be included in the total length. Receivers MUST treat pkt-lines
-with non-binary data the same whether or not they contain the trailing
-LF (stripping the LF if present, and not complaining when it is
-missing).
-
-The maximum length of a pkt-line's data component is 65516 bytes.
-Implementations MUST NOT send pkt-line whose length exceeds 65520
-(65516 bytes of payload + 4 bytes of length data).
-
-Implementations SHOULD NOT send an empty pkt-line ("0004").
-
-A pkt-line with a length field of 0 ("0000"), called a flush-pkt,
-is a special case and MUST be handled differently than an empty
-pkt-line ("0004").
-
-----
- pkt-line = data-pkt / flush-pkt
-
- data-pkt = pkt-len pkt-payload
- pkt-len = 4*(HEXDIG)
- pkt-payload = (pkt-len - 4)*(OCTET)
-
- flush-pkt = "0000"
-----
-
-Examples (as C-style strings):
-
-----
- pkt-line actual value
- ---------------------------------
- "0006a\n" "a\n"
- "0005a" "a"
- "000bfoobar\n" "foobar\n"
- "0004" ""
-----
diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
deleted file mode 100644
index 40f91f6..0000000
--- a/Documentation/technical/protocol-v2.txt
+++ /dev/null
@@ -1,455 +0,0 @@
-Git Wire Protocol, Version 2
-============================
-
-This document presents a specification for a version 2 of Git's wire
-protocol. Protocol v2 will improve upon v1 in the following ways:
-
- * Instead of multiple service names, multiple commands will be
- supported by a single service
- * Easily extendable as capabilities are moved into their own section
- of the protocol, no longer being hidden behind a NUL byte and
- limited by the size of a pkt-line
- * Separate out other information hidden behind NUL bytes (e.g. agent
- string as a capability and symrefs can be requested using 'ls-refs')
- * Reference advertisement will be omitted unless explicitly requested
- * ls-refs command to explicitly request some refs
- * Designed with http and stateless-rpc in mind. With clear flush
- semantics the http remote helper can simply act as a proxy
-
-In protocol v2 communication is command oriented. When first contacting a
-server a list of capabilities will advertised. Some of these capabilities
-will be commands which a client can request be executed. Once a command
-has completed, a client can reuse the connection and request that other
-commands be executed.
-
-Packet-Line Framing
--------------------
-
-All communication is done using packet-line framing, just as in v1. See
-`Documentation/technical/pack-protocol.txt` and
-`Documentation/technical/protocol-common.txt` for more information.
-
-In protocol v2 these special packets will have the following semantics:
-
- * '0000' Flush Packet (flush-pkt) - indicates the end of a message
- * '0001' Delimiter Packet (delim-pkt) - separates sections of a message
-
-Initial Client Request
-----------------------
-
-In general a client can request to speak protocol v2 by sending
-`version=2` through the respective side-channel for the transport being
-used which inevitably sets `GIT_PROTOCOL`. More information can be
-found in `pack-protocol.txt` and `http-protocol.txt`. In all cases the
-response from the server is the capability advertisement.
-
-Git Transport
-~~~~~~~~~~~~~
-
-When using the git:// transport, you can request to use protocol v2 by
-sending "version=2" as an extra parameter:
-
- 003egit-upload-pack /project.git\0host=myserver.com\0\0version=2\0
-
-SSH and File Transport
-~~~~~~~~~~~~~~~~~~~~~~
-
-When using either the ssh:// or file:// transport, the GIT_PROTOCOL
-environment variable must be set explicitly to include "version=2".
-
-HTTP Transport
-~~~~~~~~~~~~~~
-
-When using the http:// or https:// transport a client makes a "smart"
-info/refs request as described in `http-protocol.txt` and requests that
-v2 be used by supplying "version=2" in the `Git-Protocol` header.
-
- C: GET $GIT_URL/info/refs?service=git-upload-pack HTTP/1.0
- C: Git-Protocol: version=2
-
-A v2 server would reply:
-
- S: 200 OK
- S: <Some headers>
- S: ...
- S:
- S: 000eversion 2\n
- S: <capability-advertisement>
-
-Subsequent requests are then made directly to the service
-`$GIT_URL/git-upload-pack`. (This works the same for git-receive-pack).
-
-Capability Advertisement
-------------------------
-
-A server which decides to communicate (based on a request from a client)
-using protocol version 2, notifies the client by sending a version string
-in its initial response followed by an advertisement of its capabilities.
-Each capability is a key with an optional value. Clients must ignore all
-unknown keys. Semantics of unknown values are left to the definition of
-each key. Some capabilities will describe commands which can be requested
-to be executed by the client.
-
- capability-advertisement = protocol-version
- capability-list
- flush-pkt
-
- protocol-version = PKT-LINE("version 2" LF)
- capability-list = *capability
- capability = PKT-LINE(key[=value] LF)
-
- key = 1*(ALPHA | DIGIT | "-_")
- value = 1*(ALPHA | DIGIT | " -_.,?\/{}[]()<>!@#$%^&*+=:;")
-
-Command Request
----------------
-
-After receiving the capability advertisement, a client can then issue a
-request to select the command it wants with any particular capabilities
-or arguments. There is then an optional section where the client can
-provide any command specific parameters or queries. Only a single
-command can be requested at a time.
-
- request = empty-request | command-request
- empty-request = flush-pkt
- command-request = command
- capability-list
- [command-args]
- flush-pkt
- command = PKT-LINE("command=" key LF)
- command-args = delim-pkt
- *command-specific-arg
-
- command-specific-args are packet line framed arguments defined by
- each individual command.
-
-The server will then check to ensure that the client's request is
-comprised of a valid command as well as valid capabilities which were
-advertised. If the request is valid the server will then execute the
-command. A server MUST wait till it has received the client's entire
-request before issuing a response. The format of the response is
-determined by the command being executed, but in all cases a flush-pkt
-indicates the end of the response.
-
-When a command has finished, and the client has received the entire
-response from the server, a client can either request that another
-command be executed or can terminate the connection. A client may
-optionally send an empty request consisting of just a flush-pkt to
-indicate that no more requests will be made.
-
-Capabilities
-------------
-
-There are two different types of capabilities: normal capabilities,
-which can be used to convey information or alter the behavior of a
-request, and commands, which are the core actions that a client wants to
-perform (fetch, push, etc).
-
-Protocol version 2 is stateless by default. This means that all commands
-must only last a single round and be stateless from the perspective of the
-server side, unless the client has requested a capability indicating that
-state should be maintained by the server. Clients MUST NOT require state
-management on the server side in order to function correctly. This
-permits simple round-robin load-balancing on the server side, without
-needing to worry about state management.
-
-agent
-~~~~~
-
-The server can advertise the `agent` capability with a value `X` (in the
-form `agent=X`) to notify the client that the server is running version
-`X`. The client may optionally send its own agent string by including
-the `agent` capability with a value `Y` (in the form `agent=Y`) in its
-request to the server (but it MUST NOT do so if the server did not
-advertise the agent capability). The `X` and `Y` strings may contain any
-printable ASCII characters except space (i.e., the byte range 32 < x <
-127), and are typically of the form "package/version" (e.g.,
-"git/1.8.3.1"). The agent strings are purely informative for statistics
-and debugging purposes, and MUST NOT be used to programmatically assume
-the presence or absence of particular features.
-
-ls-refs
-~~~~~~~
-
-`ls-refs` is the command used to request a reference advertisement in v2.
-Unlike the current reference advertisement, ls-refs takes in arguments
-which can be used to limit the refs sent from the server.
-
-Additional features not supported in the base command will be advertised
-as the value of the command in the capability advertisement in the form
-of a space separated list of features: "<command>=<feature 1> <feature 2>"
-
-ls-refs takes in the following arguments:
-
- symrefs
- In addition to the object pointed by it, show the underlying ref
- pointed by it when showing a symbolic ref.
- peel
- Show peeled tags.
- ref-prefix <prefix>
- When specified, only references having a prefix matching one of
- the provided prefixes are displayed.
-
-The output of ls-refs is as follows:
-
- output = *ref
- flush-pkt
- ref = PKT-LINE(obj-id SP refname *(SP ref-attribute) LF)
- ref-attribute = (symref | peeled)
- symref = "symref-target:" symref-target
- peeled = "peeled:" obj-id
-
-fetch
-~~~~~
-
-`fetch` is the command used to fetch a packfile in v2. It can be looked
-at as a modified version of the v1 fetch where the ref-advertisement is
-stripped out (since the `ls-refs` command fills that role) and the
-message format is tweaked to eliminate redundancies and permit easy
-addition of future extensions.
-
-Additional features not supported in the base command will be advertised
-as the value of the command in the capability advertisement in the form
-of a space separated list of features: "<command>=<feature 1> <feature 2>"
-
-A `fetch` request can take the following arguments:
-
- want <oid>
- Indicates to the server an object which the client wants to
- retrieve. Wants can be anything and are not limited to
- advertised objects.
-
- have <oid>
- Indicates to the server an object which the client has locally.
- This allows the server to make a packfile which only contains
- the objects that the client needs. Multiple 'have' lines can be
- supplied.
-
- done
- Indicates to the server that negotiation should terminate (or
- not even begin if performing a clone) and that the server should
- use the information supplied in the request to construct the
- packfile.
-
- thin-pack
- Request that a thin pack be sent, which is a pack with deltas
- which reference base objects not contained within the pack (but
- are known to exist at the receiving end). This can reduce the
- network traffic significantly, but it requires the receiving end
- to know how to "thicken" these packs by adding the missing bases
- to the pack.
-
- no-progress
- Request that progress information that would normally be sent on
- side-band channel 2, during the packfile transfer, should not be
- sent. However, the side-band channel 3 is still used for error
- responses.
-
- include-tag
- Request that annotated tags should be sent if the objects they
- point to are being sent.
-
- ofs-delta
- Indicate that the client understands PACKv2 with delta referring
- to its base by position in pack rather than by an oid. That is,
- they can read OBJ_OFS_DELTA (ake type 6) in a packfile.
-
-If the 'shallow' feature is advertised the following arguments can be
-included in the clients request as well as the potential addition of the
-'shallow-info' section in the server's response as explained below.
-
- shallow <oid>
- A client must notify the server of all commits for which it only
- has shallow copies (meaning that it doesn't have the parents of
- a commit) by supplying a 'shallow <oid>' line for each such
- object so that the server is aware of the limitations of the
- client's history. This is so that the server is aware that the
- client may not have all objects reachable from such commits.
-
- deepen <depth>
- Requests that the fetch/clone should be shallow having a commit
- depth of <depth> relative to the remote side.
-
- deepen-relative
- Requests that the semantics of the "deepen" command be changed
- to indicate that the depth requested is relative to the client's
- current shallow boundary, instead of relative to the requested
- commits.
-
- deepen-since <timestamp>
- Requests that the shallow clone/fetch should be cut at a
- specific time, instead of depth. Internally it's equivalent to
- doing "git rev-list --max-age=<timestamp>". Cannot be used with
- "deepen".
-
- deepen-not <rev>
- Requests that the shallow clone/fetch should be cut at a
- specific revision specified by '<rev>', instead of a depth.
- Internally it's equivalent of doing "git rev-list --not <rev>".
- Cannot be used with "deepen", but can be used with
- "deepen-since".
-
-If the 'filter' feature is advertised, the following argument can be
-included in the client's request:
-
- filter <filter-spec>
- Request that various objects from the packfile be omitted
- using one of several filtering techniques. These are intended
- for use with partial clone and partial fetch operations. See
- `rev-list` for possible "filter-spec" values. When communicating
- with other processes, senders SHOULD translate scaled integers
- (e.g. "1k") into a fully-expanded form (e.g. "1024") to aid
- interoperability with older receivers that may not understand
- newly-invented scaling suffixes. However, receivers SHOULD
- accept the following suffixes: 'k', 'm', and 'g' for 1024,
- 1048576, and 1073741824, respectively.
-
-If the 'ref-in-want' feature is advertised, the following argument can
-be included in the client's request as well as the potential addition of
-the 'wanted-refs' section in the server's response as explained below.
-
- want-ref <ref>
- Indicates to the server that the client wants to retrieve a
- particular ref, where <ref> is the full name of a ref on the
- server.
-
-If the 'sideband-all' feature is advertised, the following argument can be
-included in the client's request:
-
- sideband-all
- Instruct the server to send the whole response multiplexed, not just
- the packfile section. All non-flush and non-delim PKT-LINE in the
- response (not only in the packfile section) will then start with a byte
- indicating its sideband (1, 2, or 3), and the server may send "0005\2"
- (a PKT-LINE of sideband 2 with no payload) as a keepalive packet.
-
-The response of `fetch` is broken into a number of sections separated by
-delimiter packets (0001), with each section beginning with its section
-header.
-
- output = *section
- section = (acknowledgments | shallow-info | wanted-refs | packfile)
- (flush-pkt | delim-pkt)
-
- acknowledgments = PKT-LINE("acknowledgments" LF)
- (nak | *ack)
- (ready)
- ready = PKT-LINE("ready" LF)
- nak = PKT-LINE("NAK" LF)
- ack = PKT-LINE("ACK" SP obj-id LF)
-
- shallow-info = PKT-LINE("shallow-info" LF)
- *PKT-LINE((shallow | unshallow) LF)
- shallow = "shallow" SP obj-id
- unshallow = "unshallow" SP obj-id
-
- wanted-refs = PKT-LINE("wanted-refs" LF)
- *PKT-LINE(wanted-ref LF)
- wanted-ref = obj-id SP refname
-
- packfile = PKT-LINE("packfile" LF)
- *PKT-LINE(%x01-03 *%x00-ff)
-
- acknowledgments section
- * If the client determines that it is finished with negotiations
- by sending a "done" line, the acknowledgments sections MUST be
- omitted from the server's response.
-
- * Always begins with the section header "acknowledgments"
-
- * The server will respond with "NAK" if none of the object ids sent
- as have lines were common.
-
- * The server will respond with "ACK obj-id" for all of the
- object ids sent as have lines which are common.
-
- * A response cannot have both "ACK" lines as well as a "NAK"
- line.
-
- * The server will respond with a "ready" line indicating that
- the server has found an acceptable common base and is ready to
- make and send a packfile (which will be found in the packfile
- section of the same response)
-
- * If the server has found a suitable cut point and has decided
- to send a "ready" line, then the server can decide to (as an
- optimization) omit any "ACK" lines it would have sent during
- its response. This is because the server will have already
- determined the objects it plans to send to the client and no
- further negotiation is needed.
-
- shallow-info section
- * If the client has requested a shallow fetch/clone, a shallow
- client requests a fetch or the server is shallow then the
- server's response may include a shallow-info section. The
- shallow-info section will be included if (due to one of the
- above conditions) the server needs to inform the client of any
- shallow boundaries or adjustments to the clients already
- existing shallow boundaries.
-
- * Always begins with the section header "shallow-info"
-
- * If a positive depth is requested, the server will compute the
- set of commits which are no deeper than the desired depth.
-
- * The server sends a "shallow obj-id" line for each commit whose
- parents will not be sent in the following packfile.
-
- * The server sends an "unshallow obj-id" line for each commit
- which the client has indicated is shallow, but is no longer
- shallow as a result of the fetch (due to its parents being
- sent in the following packfile).
-
- * The server MUST NOT send any "unshallow" lines for anything
- which the client has not indicated was shallow as a part of
- its request.
-
- * This section is only included if a packfile section is also
- included in the response.
-
- wanted-refs section
- * This section is only included if the client has requested a
- ref using a 'want-ref' line and if a packfile section is also
- included in the response.
-
- * Always begins with the section header "wanted-refs".
-
- * The server will send a ref listing ("<oid> <refname>") for
- each reference requested using 'want-ref' lines.
-
- * The server MUST NOT send any refs which were not requested
- using 'want-ref' lines.
-
- packfile section
- * This section is only included if the client has sent 'want'
- lines in its request and either requested that no more
- negotiation be done by sending 'done' or if the server has
- decided it has found a sufficient cut point to produce a
- packfile.
-
- * Always begins with the section header "packfile"
-
- * The transmission of the packfile begins immediately after the
- section header
-
- * The data transfer of the packfile is always multiplexed, using
- the same semantics of the 'side-band-64k' capability from
- protocol version 1. This means that each packet, during the
- packfile data stream, is made up of a leading 4-byte pkt-line
- length (typical of the pkt-line format), followed by a 1-byte
- stream code, followed by the actual data.
-
- The stream code can be one of:
- 1 - pack data
- 2 - progress messages
- 3 - fatal error message just before stream aborts
-
-server-option
-~~~~~~~~~~~~~
-
-If advertised, indicates that any number of server specific options can be
-included in a request. This is done by sending each option as a
-"server-option=<option>" capability line in the capability-list section of
-a request.
-
-The provided options must not contain a NUL or LF character.
diff --git a/Documentation/technical/racy-git.txt b/Documentation/technical/racy-git.txt
index 4a8be4d..59bea66 100644
--- a/Documentation/technical/racy-git.txt
+++ b/Documentation/technical/racy-git.txt
@@ -11,7 +11,7 @@ write out the next tree object to be committed. The state is
"virtual" in the sense that it does not necessarily have to, and
often does not, match the files in the working tree.
-There are cases Git needs to examine the differences between the
+There are cases where Git needs to examine the differences between the
virtual working tree state in the index and the files in the
working tree. The most obvious case is when the user asks `git
diff` (or its low level implementation, `git diff-files`) or
@@ -51,7 +51,7 @@ of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
only fixes the issue for file systems with exactly 1 ns or 1 s
resolution. Other file systems are still broken in current Linux
kernels (e.g. CEPH, CIFS, NTFS, UDF), see
-https://lkml.org/lkml/2015/6/9/714
+https://lore.kernel.org/lkml/5577240D.7020309@gmail.com/
Racy Git
--------
@@ -165,9 +165,9 @@ Avoiding runtime penalty
In order to avoid the above runtime penalty, post 1.4.2 Git used
to have a code that made sure the index file
-got timestamp newer than the youngest files in the index when
-there are many young files with the same timestamp as the
-resulting index file would otherwise would have by waiting
+got a timestamp newer than the youngest files in the index when
+there were many young files with the same timestamp as the
+resulting index file otherwise would have by waiting
before finishing writing the index file out.
I suspected that in practice the situation where many paths in the
@@ -190,7 +190,7 @@ In a large project where raciness avoidance cost really matters,
however, the initial computation of all object names in the
index takes more than one second, and the index file is written
out after all that happens. Therefore the timestamp of the
-index file will be more than one seconds later than the
+index file will be more than one second later than the
youngest file in the working tree. This means that in these
cases there actually will not be any racily clean entry in
the resulting index.
diff --git a/Documentation/technical/reftable.txt b/Documentation/technical/reftable.txt
new file mode 100644
index 0000000..dd0b37c
--- /dev/null
+++ b/Documentation/technical/reftable.txt
@@ -0,0 +1,1098 @@
+reftable
+--------
+
+Overview
+~~~~~~~~
+
+Problem statement
+^^^^^^^^^^^^^^^^^
+
+Some repositories contain a lot of references (e.g. android at 866k,
+rails at 31k). The existing packed-refs format takes up a lot of space
+(e.g. 62M), and does not scale with additional references. Lookup of a
+single reference requires linearly scanning the file.
+
+Atomic pushes modifying multiple references require copying the entire
+packed-refs file, which can be a considerable amount of data moved
+(e.g. 62M in, 62M out) for even small transactions (2 refs modified).
+
+Repositories with many loose references occupy a large number of disk
+blocks from the local file system, as each reference is its own file
+storing 41 bytes (and another file for the corresponding reflog). This
+negatively affects the number of inodes available when a large number of
+repositories are stored on the same filesystem. Readers can be penalized
+due to the larger number of syscalls required to traverse and read the
+`$GIT_DIR/refs` directory.
+
+
+Objectives
+^^^^^^^^^^
+
+* Near constant time lookup for any single reference, even when the
+repository is cold and not in process or kernel cache.
+* Near constant time verification if an object name is referred to by at least
+one reference (for allow-tip-sha1-in-want).
+* Efficient enumeration of an entire namespace, such as `refs/tags/`.
+* Support atomic push with `O(size_of_update)` operations.
+* Combine reflog storage with ref storage for small transactions.
+* Separate reflog storage for base refs and historical logs.
+
+Description
+^^^^^^^^^^^
+
+A reftable file is a portable binary file format customized for
+reference storage. References are sorted, enabling linear scans, binary
+search lookup, and range scans.
+
+Storage in the file is organized into variable sized blocks. Prefix
+compression is used within a single block to reduce disk space. Block
+size and alignment are tunable by the writer.
+
+Performance
+^^^^^^^^^^^
+
+Space used, packed-refs vs. reftable:
+
+[cols=",>,>,>,>,>",options="header",]
+|===============================================================
+|repository |packed-refs |reftable |% original |avg ref |avg obj
+|android |62.2 M |36.1 M |58.0% |33 bytes |5 bytes
+|rails |1.8 M |1.1 M |57.7% |29 bytes |4 bytes
+|git |78.7 K |48.1 K |61.0% |50 bytes |4 bytes
+|git (heads) |332 b |269 b |81.0% |33 bytes |0 bytes
+|===============================================================
+
+Scan (read 866k refs), by reference name lookup (single ref from 866k
+refs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs):
+
+[cols=",>,>,>,>",options="header",]
+|=========================================================
+|format |cache |scan |by name |by SHA-1
+|packed-refs |cold |402 ms |409,660.1 usec |412,535.8 usec
+|packed-refs |hot | |6,844.6 usec |20,110.1 usec
+|reftable |cold |112 ms |33.9 usec |323.2 usec
+|reftable |hot | |20.2 usec |320.8 usec
+|=========================================================
+
+Space used for 149,932 log entries for 43,061 refs, reflog vs. reftable:
+
+[cols=",>,>",options="header",]
+|================================
+|format |size |avg entry
+|$GIT_DIR/logs |173 M |1209 bytes
+|reftable |5 M |37 bytes
+|================================
+
+Details
+~~~~~~~
+
+Peeling
+^^^^^^^
+
+References stored in a reftable are peeled, a record for an annotated
+(or signed) tag records both the tag object, and the object it refers
+to. This is analogous to storage in the packed-refs format.
+
+Reference name encoding
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Reference names are an uninterpreted sequence of bytes that must pass
+linkgit:git-check-ref-format[1] as a valid reference name.
+
+Key unicity
+^^^^^^^^^^^
+
+Each entry must have a unique key; repeated keys are disallowed.
+
+Network byte order
+^^^^^^^^^^^^^^^^^^
+
+All multi-byte, fixed width fields are in network byte order.
+
+Varint encoding
+^^^^^^^^^^^^^^^
+
+Varint encoding is identical to the ofs-delta encoding method used
+within pack files.
+
+Decoder works as follows:
+
+....
+val = buf[ptr] & 0x7f
+while (buf[ptr] & 0x80) {
+ ptr++
+ val = ((val + 1) << 7) | (buf[ptr] & 0x7f)
+}
+....
+
+Ordering
+^^^^^^^^
+
+Blocks are lexicographically ordered by their first reference.
+
+Directory/file conflicts
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The reftable format accepts both `refs/heads/foo` and
+`refs/heads/foo/bar` as distinct references.
+
+This property is useful for retaining log records in reftable, but may
+confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain
+references. Users of reftable may choose to continue to reject `foo` and
+`foo/bar` type conflicts to prevent problems for peers.
+
+File format
+~~~~~~~~~~~
+
+Structure
+^^^^^^^^^
+
+A reftable file has the following high-level structure:
+
+....
+first_block {
+ header
+ first_ref_block
+}
+ref_block*
+ref_index*
+obj_block*
+obj_index*
+log_block*
+log_index*
+footer
+....
+
+A log-only file omits the `ref_block`, `ref_index`, `obj_block` and
+`obj_index` sections, containing only the file header and log block:
+
+....
+first_block {
+ header
+}
+log_block*
+log_index*
+footer
+....
+
+In a log-only file, the first log block immediately follows the file
+header, without padding to block alignment.
+
+Block size
+^^^^^^^^^^
+
+The file's block size is arbitrarily determined by the writer, and does
+not have to be a power of 2. The block size must be larger than the
+longest reference name or log entry used in the repository, as
+references cannot span blocks.
+
+Powers of two that are friendly to the virtual memory system or
+filesystem (such as 4k or 8k) are recommended. Larger sizes (64k) can
+yield better compression, with a possible increased cost incurred by
+readers during access.
+
+The largest block size is `16777215` bytes (15.99 MiB).
+
+Block alignment
+^^^^^^^^^^^^^^^
+
+Writers may choose to align blocks at multiples of the block size by
+including `padding` filled with NUL bytes at the end of a block to round
+out to the chosen alignment. When alignment is used, writers must
+specify the alignment with the file header's `block_size` field.
+
+Block alignment is not required by the file format. Unaligned files must
+set `block_size = 0` in the file header, and omit `padding`. Unaligned
+files with more than one ref block must include the link:#Ref-index[ref
+index] to support fast lookup. Readers must be able to read both aligned
+and non-aligned files.
+
+Very small files (e.g. a single ref block) may omit `padding` and the ref
+index to reduce total file size.
+
+Header (version 1)
+^^^^^^^^^^^^^^^^^^
+
+A 24-byte header appears at the beginning of the file:
+
+....
+'REFT'
+uint8( version_number = 1 )
+uint24( block_size )
+uint64( min_update_index )
+uint64( max_update_index )
+....
+
+Aligned files must specify `block_size` to configure readers with the
+expected block alignment. Unaligned files must set `block_size = 0`.
+
+The `min_update_index` and `max_update_index` describe bounds for the
+`update_index` field of all log records in this file. When reftables are
+used in a stack for link:#Update-transactions[transactions], these
+fields can order the files such that the prior file's
+`max_update_index + 1` is the next file's `min_update_index`.
+
+Header (version 2)
+^^^^^^^^^^^^^^^^^^
+
+A 28-byte header appears at the beginning of the file:
+
+....
+'REFT'
+uint8( version_number = 2 )
+uint24( block_size )
+uint64( min_update_index )
+uint64( max_update_index )
+uint32( hash_id )
+....
+
+The header is identical to `version_number=1`, with the 4-byte hash ID
+("sha1" for SHA1 and "s256" for SHA-256) appended to the header.
+
+For maximum backward compatibility, it is recommended to use version 1 when
+writing SHA1 reftables.
+
+First ref block
+^^^^^^^^^^^^^^^
+
+The first ref block shares the same block as the file header, and is 24
+bytes smaller than all other blocks in the file. The first block
+immediately begins after the file header, at position 24.
+
+If the first block is a log block (a log-only file), its block header
+begins immediately at position 24.
+
+Ref block format
+^^^^^^^^^^^^^^^^
+
+A ref block is written as:
+
+....
+'r'
+uint24( block_len )
+ref_record+
+uint24( restart_offset )+
+uint16( restart_count )
+
+padding?
+....
+
+Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which
+encodes the number of bytes in the block up to, but not including the
+optional `padding`. This is always less than or equal to the file's
+block size. In the first ref block, `block_len` includes 24 bytes for
+the file header.
+
+The 2-byte `restart_count` stores the number of entries in the
+`restart_offset` list, which must not be empty. Readers can use
+`restart_count` to binary search between restarts before starting a
+linear scan.
+
+Exactly `restart_count` 3-byte `restart_offset` values precede the
+`restart_count`. Offsets are relative to the start of the block and
+refer to the first byte of any `ref_record` whose name has not been
+prefix compressed. Entries in the `restart_offset` list must be sorted,
+ascending. Readers can start linear scans from any of these records.
+
+A variable number of `ref_record` fill the middle of the block,
+describing reference names and values. The format is described below.
+
+As the first ref block shares the first file block with the file header,
+all `restart_offset` in the first block are relative to the start of the
+file (position 0), and include the file header. This forces the first
+`restart_offset` to be `28`.
+
+ref record
+++++++++++
+
+A `ref_record` describes a single reference, storing both the name and
+its value(s). Records are formatted as:
+
+....
+varint( prefix_length )
+varint( (suffix_length << 3) | value_type )
+suffix
+varint( update_index_delta )
+value?
+....
+
+The `prefix_length` field specifies how many leading bytes of the prior
+reference record's name should be copied to obtain this reference's
+name. This must be 0 for the first reference in any block, and also must
+be 0 for any `ref_record` whose offset is listed in the `restart_offset`
+table at the end of the block.
+
+Recovering a reference name from any `ref_record` is a simple concat:
+
+....
+this_name = prior_name[0..prefix_length] + suffix
+....
+
+The `suffix_length` value provides the number of bytes available in
+`suffix` to copy from `suffix` to complete the reference name.
+
+The `update_index` that last modified the reference can be obtained by
+adding `update_index_delta` to the `min_update_index` from the file
+header: `min_update_index + update_index_delta`.
+
+The `value` follows. Its format is determined by `value_type`, one of
+the following:
+
+* `0x0`: deletion; no value data (see transactions, below)
+* `0x1`: one object name; value of the ref
+* `0x2`: two object names; value of the ref, peeled target
+* `0x3`: symbolic reference: `varint( target_len ) target`
+
+Symbolic references use `0x3`, followed by the complete name of the
+reference target. No compression is applied to the target name.
+
+Types `0x4..0x7` are reserved for future use.
+
+Ref index
+^^^^^^^^^
+
+The ref index stores the name of the last reference from every ref block
+in the file, enabling reduced disk seeks for lookups. Any reference can
+be found by searching the index, identifying the containing block, and
+searching within that block.
+
+The index may be organized into a multi-level index, where the 1st level
+index block points to additional ref index blocks (2nd level), which may
+in turn point to either additional index blocks (e.g. 3rd level) or ref
+blocks (leaf level). Disk reads required to access a ref go up with
+higher index levels. Multi-level indexes may be required to ensure no
+single index block exceeds the file format's max block size of
+`16777215` bytes (15.99 MiB). To achieve constant O(1) disk seeks for
+lookups the index must be a single level, which is permitted to exceed
+the file's configured block size, but not the format's max block size of
+15.99 MiB.
+
+If present, the ref index block(s) appears after the last ref block.
+
+If there are at least 4 ref blocks, a ref index block should be written
+to improve lookup times. Cold reads using the index require 2 disk reads
+(read index, read block), and binary searching < 4 blocks also requires
+<= 2 reads. Omitting the index block from smaller files saves space.
+
+If the file is unaligned and contains more than one ref block, the ref
+index must be written.
+
+Index block format:
+
+....
+'i'
+uint24( block_len )
+index_record+
+uint24( restart_offset )+
+uint16( restart_count )
+
+padding?
+....
+
+The index blocks begin with `block_type = 'i'` and a 3-byte `block_len`
+which encodes the number of bytes in the block, up to but not including
+the optional `padding`.
+
+The `restart_offset` and `restart_count` fields are identical in format,
+meaning and usage as in ref blocks.
+
+To reduce the number of reads required for random access in very large
+files the index block may be larger than other blocks. However, readers
+must hold the entire index in memory to benefit from this, so it's a
+time-space tradeoff in both file size and reader memory.
+
+Increasing the file's block size decreases the index size. Alternatively
+a multi-level index may be used, keeping index blocks within the file's
+block size, but increasing the number of blocks that need to be
+accessed.
+
+index record
+++++++++++++
+
+An index record describes the last entry in another block. Index records
+are written as:
+
+....
+varint( prefix_length )
+varint( (suffix_length << 3) | 0 )
+suffix
+varint( block_position )
+....
+
+Index records use prefix compression exactly like `ref_record`.
+
+Index records store `block_position` after the suffix, specifying the
+absolute position in bytes (from the start of the file) of the block
+that ends with this reference. Readers can seek to `block_position` to
+begin reading the block header.
+
+Readers must examine the block header at `block_position` to determine
+if the next block is another level index block, or the leaf-level ref
+block.
+
+Reading the index
++++++++++++++++++
+
+Readers loading the ref index must first read the footer (below) to
+obtain `ref_index_position`. If not present, the position will be 0. The
+`ref_index_position` is for the 1st level root of the ref index.
+
+Obj block format
+^^^^^^^^^^^^^^^^
+
+Object blocks are optional. Writers may choose to omit object blocks,
+especially if readers will not use the object name to ref mapping.
+
+Object blocks use unique, abbreviated 2-31 byte object name keys, mapping to
+ref blocks containing references pointing to that object directly, or as
+the peeled value of an annotated tag. Like ref blocks, object blocks use
+the file's standard block size. The abbreviation length is available in
+the footer as `obj_id_len`.
+
+To save space in small files, object blocks may be omitted if the ref
+index is not present, as brute force search will only need to read a few
+ref blocks. When missing, readers should brute force a linear search of
+all references to lookup by object name.
+
+An object block is written as:
+
+....
+'o'
+uint24( block_len )
+obj_record+
+uint24( restart_offset )+
+uint16( restart_count )
+
+padding?
+....
+
+Fields are identical to ref block. Binary search using the restart table
+works the same as in reference blocks.
+
+Because object names are abbreviated by writers to the shortest unique
+abbreviation within the reftable, obj key lengths have a variable length. Their
+length must be at least 2 bytes. Readers must compare only for common prefix
+match within an obj block or obj index.
+
+obj record
+++++++++++
+
+An `obj_record` describes a single object abbreviation, and the blocks
+containing references using that unique abbreviation:
+
+....
+varint( prefix_length )
+varint( (suffix_length << 3) | cnt_3 )
+suffix
+varint( cnt_large )?
+varint( position_delta )*
+....
+
+Like in reference blocks, abbreviations are prefix compressed within an
+obj block. On large reftables with many unique objects, higher block
+sizes (64k), and higher restart interval (128), a `prefix_length` of 2
+or 3 and `suffix_length` of 3 may be common in obj records (unique
+abbreviation of 5-6 raw bytes, 10-12 hex digits).
+
+Each record contains `position_count` number of positions for matching
+ref blocks. For 1-7 positions the count is stored in `cnt_3`. When
+`cnt_3 = 0` the actual count follows in a varint, `cnt_large`.
+
+The use of `cnt_3` bets most objects are pointed to by only a single
+reference, some may be pointed to by a couple of references, and very
+few (if any) are pointed to by more than 7 references.
+
+A special case exists when `cnt_3 = 0` and `cnt_large = 0`: there are no
+`position_delta`, but at least one reference starts with this
+abbreviation. A reader that needs exact reference names must scan all
+references to find which specific references have the desired object.
+Writers should use this format when the `position_delta` list would have
+overflowed the file's block size due to a high number of references
+pointing to the same object.
+
+The first `position_delta` is the position from the start of the file.
+Additional `position_delta` entries are sorted ascending and relative to
+the prior entry, e.g. a reader would perform:
+
+....
+pos = position_delta[0]
+prior = pos
+for (j = 1; j < position_count; j++) {
+ pos = prior + position_delta[j]
+ prior = pos
+}
+....
+
+With a position in hand, a reader must linearly scan the ref block,
+starting from the first `ref_record`, testing each reference's object names
+(for `value_type = 0x1` or `0x2`) for full equality. Faster searching by
+object name within a single ref block is not supported by the reftable format.
+Smaller block sizes reduce the number of candidates this step must
+consider.
+
+Obj index
+^^^^^^^^^
+
+The obj index stores the abbreviation from the last entry for every obj
+block in the file, enabling reduced disk seeks for all lookups. It is
+formatted exactly the same as the ref index, but refers to obj blocks.
+
+The obj index should be present if obj blocks are present, as obj blocks
+should only be written in larger files.
+
+Readers loading the obj index must first read the footer (below) to
+obtain `obj_index_position`. If not present, the position will be 0.
+
+Log block format
+^^^^^^^^^^^^^^^^
+
+Unlike ref and obj blocks, log blocks are always unaligned.
+
+Log blocks are variable in size, and do not match the `block_size`
+specified in the file header or footer. Writers should choose an
+appropriate buffer size to prepare a log block for deflation, such as
+`2 * block_size`.
+
+A log block is written as:
+
+....
+'g'
+uint24( block_len )
+zlib_deflate {
+ log_record+
+ uint24( restart_offset )+
+ uint16( restart_count )
+}
+....
+
+Log blocks look similar to ref blocks, except `block_type = 'g'`.
+
+The 4-byte block header is followed by the deflated block contents using
+zlib deflate. The `block_len` in the header is the inflated size
+(including 4-byte block header), and should be used by readers to
+preallocate the inflation output buffer. A log block's `block_len` may
+exceed the file's block size.
+
+Offsets within the log block (e.g. `restart_offset`) still include the
+4-byte header. Readers may prefer prefixing the inflation output buffer
+with the 4-byte header.
+
+Within the deflate container, a variable number of `log_record` describe
+reference changes. The log record format is described below. See ref
+block format (above) for a description of `restart_offset` and
+`restart_count`.
+
+Because log blocks have no alignment or padding between blocks, readers
+must keep track of the bytes consumed by the inflater to know where the
+next log block begins.
+
+log record
+++++++++++
+
+Log record keys are structured as:
+
+....
+ref_name '\0' reverse_int64( update_index )
+....
+
+where `update_index` is the unique transaction identifier. The
+`update_index` field must be unique within the scope of a `ref_name`.
+See the update transactions section below for further details.
+
+The `reverse_int64` function inverses the value so lexicographical
+ordering the network byte order encoding sorts the more recent records
+with higher `update_index` values first:
+
+....
+reverse_int64(int64 t) {
+ return 0xffffffffffffffff - t;
+}
+....
+
+Log records have a similar starting structure to ref and index records,
+utilizing the same prefix compression scheme applied to the log record
+key described above.
+
+....
+ varint( prefix_length )
+ varint( (suffix_length << 3) | log_type )
+ suffix
+ log_data {
+ old_id
+ new_id
+ varint( name_length ) name
+ varint( email_length ) email
+ varint( time_seconds )
+ sint16( tz_offset )
+ varint( message_length ) message
+ }?
+....
+
+Log record entries use `log_type` to indicate what follows:
+
+* `0x0`: deletion; no log data.
+* `0x1`: standard git reflog data using `log_data` above.
+
+The `log_type = 0x0` is mostly useful for `git stash drop`, removing an
+entry from the reflog of `refs/stash` in a transaction file (below),
+without needing to rewrite larger files. Readers reading a stack of
+reflogs must treat this as a deletion.
+
+For `log_type = 0x1`, the `log_data` section follows
+linkgit:git-update-ref[1] logging and includes:
+
+* two object names (old id, new id)
+* varint string of committer's name
+* varint string of committer's email
+* varint time in seconds since epoch (Jan 1, 1970)
+* 2-byte timezone offset in minutes (signed)
+* varint string of message
+
+`tz_offset` is the absolute number of minutes from GMT the committer was
+at the time of the update. For example `GMT-0800` is encoded in reftable
+as `sint16(-480)` and `GMT+0230` is `sint16(150)`.
+
+The committer email does not contain `<` or `>`, it's the value normally
+found between the `<>` in a git commit object header.
+
+The `message_length` may be 0, in which case there was no message
+supplied for the update.
+
+Contrary to traditional reflog (which is a file), renames are encoded as
+a combination of ref deletion and ref creation. A deletion is a log
+record with a zero new_id, and a creation is a log record with a zero old_id.
+
+Reading the log
++++++++++++++++
+
+Readers accessing the log must first read the footer (below) to
+determine the `log_position`. The first block of the log begins at
+`log_position` bytes since the start of the file. The `log_position` is
+not block aligned.
+
+Importing logs
+++++++++++++++
+
+When importing from `$GIT_DIR/logs` writers should globally order all
+log records roughly by timestamp while preserving file order, and assign
+unique, increasing `update_index` values for each log line. Newer log
+records get higher `update_index` values.
+
+Although an import may write only a single reftable file, the reftable
+file must span many unique `update_index`, as each log line requires its
+own `update_index` to preserve semantics.
+
+Log index
+^^^^^^^^^
+
+The log index stores the log key
+(`refname \0 reverse_int64(update_index)`) for the last log record of
+every log block in the file, supporting bounded-time lookup.
+
+A log index block must be written if 2 or more log blocks are written to
+the file. If present, the log index appears after the last log block.
+There is no padding used to align the log index to block alignment.
+
+Log index format is identical to ref index, except the keys are 9 bytes
+longer to include `'\0'` and the 8-byte `reverse_int64(update_index)`.
+Records use `block_position` to refer to the start of a log block.
+
+Reading the index
++++++++++++++++++
+
+Readers loading the log index must first read the footer (below) to
+obtain `log_index_position`. If not present, the position will be 0.
+
+Footer
+^^^^^^
+
+After the last block of the file, a file footer is written. It begins
+like the file header, but is extended with additional data.
+
+....
+ HEADER
+
+ uint64( ref_index_position )
+ uint64( (obj_position << 5) | obj_id_len )
+ uint64( obj_index_position )
+
+ uint64( log_position )
+ uint64( log_index_position )
+
+ uint32( CRC-32 of above )
+....
+
+If a section is missing (e.g. ref index) the corresponding position
+field (e.g. `ref_index_position`) will be 0.
+
+* `obj_position`: byte position for the first obj block.
+* `obj_id_len`: number of bytes used to abbreviate object names in
+obj blocks.
+* `log_position`: byte position for the first log block.
+* `ref_index_position`: byte position for the start of the ref index.
+* `obj_index_position`: byte position for the start of the obj index.
+* `log_index_position`: byte position for the start of the log index.
+
+The size of the footer is 68 bytes for version 1, and 72 bytes for
+version 2.
+
+Reading the footer
+++++++++++++++++++
+
+Readers must first read the file start to determine the version
+number. Then they seek to `file_length - FOOTER_LENGTH` to access the
+footer. A trusted external source (such as `stat(2)`) is necessary to
+obtain `file_length`. When reading the footer, readers must verify:
+
+* 4-byte magic is correct
+* 1-byte version number is recognized
+* 4-byte CRC-32 matches the other 64 bytes (including magic, and
+version)
+
+Once verified, the other fields of the footer can be accessed.
+
+Empty tables
+++++++++++++
+
+A reftable may be empty. In this case, the file starts with a header
+and is immediately followed by a footer.
+
+Binary search
+^^^^^^^^^^^^^
+
+Binary search within a block is supported by the `restart_offset` fields
+at the end of the block. Readers can binary search through the restart
+table to locate between which two restart points the sought reference or
+key should appear.
+
+Each record identified by a `restart_offset` stores the complete key in
+the `suffix` field of the record, making the compare operation during
+binary search straightforward.
+
+Once a restart point lexicographically before the sought reference has
+been identified, readers can linearly scan through the following record
+entries to locate the sought record, terminating if the current record
+sorts after (and therefore the sought key is not present).
+
+Restart point selection
++++++++++++++++++++++++
+
+Writers determine the restart points at file creation. The process is
+arbitrary, but every 16 or 64 records is recommended. Every 16 may be
+more suitable for smaller block sizes (4k or 8k), every 64 for larger
+block sizes (64k).
+
+More frequent restart points reduces prefix compression and increases
+space consumed by the restart table, both of which increase file size.
+
+Less frequent restart points makes prefix compression more effective,
+decreasing overall file size, with increased penalties for readers
+walking through more records after the binary search step.
+
+A maximum of `65535` restart points per block is supported.
+
+Considerations
+~~~~~~~~~~~~~~
+
+Lightweight refs dominate
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The reftable format assumes the vast majority of references are single
+object names valued with common prefixes, such as Gerrit Code Review's
+`refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many
+lightweight tags in the `refs/tags/` namespace.
+
+Annotated tags storing the peeled object cost an additional object name per
+reference.
+
+Low overhead
+^^^^^^^^^^^^
+
+A reftable with very few references (e.g. git.git with 5 heads) is 269
+bytes for reftable, vs. 332 bytes for packed-refs. This supports
+reftable scaling down for transaction logs (below).
+
+Block size
+^^^^^^^^^^
+
+For a Gerrit Code Review type repository with many change refs, larger
+block sizes (64 KiB) and less frequent restart points (every 64) yield
+better compression due to more references within the block compressing
+against the prior reference.
+
+Larger block sizes reduce the index size, as the reftable will require
+fewer blocks to store the same number of references.
+
+Minimal disk seeks
+^^^^^^^^^^^^^^^^^^
+
+Assuming the index block has been loaded into memory, binary searching
+for any single reference requires exactly 1 disk seek to load the
+containing block.
+
+Scans and lookups dominate
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Scanning all references and lookup by name (or namespace such as
+`refs/heads/`) are the most common activities performed on repositories.
+Object names are stored directly with references to optimize this use case.
+
+Logs are infrequently read
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Logs are infrequently accessed, but can be large. Deflating log blocks
+saves disk space, with some increased penalty at read time.
+
+Logs are stored in an isolated section from refs, reducing the burden on
+reference readers that want to ignore logs. Further, historical logs can
+be isolated into log-only files.
+
+Logs are read backwards
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Logs are frequently accessed backwards (most recent N records for master
+to answer `master@{4}`), so log records are grouped by reference, and
+sorted descending by update index.
+
+Repository format
+~~~~~~~~~~~~~~~~~
+
+Version 1
+^^^^^^^^^
+
+A repository must set its `$GIT_DIR/config` to configure reftable:
+
+....
+[core]
+ repositoryformatversion = 1
+[extensions]
+ refStorage = reftable
+....
+
+Layout
+^^^^^^
+
+A collection of reftable files are stored in the `$GIT_DIR/reftable/` directory.
+Their names should have a random element, such that each filename is globally
+unique; this helps avoid spurious failures on Windows, where open files cannot
+be removed or overwritten. It suggested to use
+`${min_update_index}-${max_update_index}-${random}.ref` as a naming convention.
+
+Log-only files use the `.log` extension, while ref-only and mixed ref
+and log files use `.ref`. extension.
+
+The stack ordering file is `$GIT_DIR/reftable/tables.list` and lists the
+current files, one per line, in order, from oldest (base) to newest
+(most recent):
+
+....
+$ cat .git/reftable/tables.list
+00000001-00000001-RANDOM1.log
+00000002-00000002-RANDOM2.ref
+00000003-00000003-RANDOM3.ref
+....
+
+Readers must read `$GIT_DIR/reftable/tables.list` to determine which
+files are relevant right now, and search through the stack in reverse
+order (last reftable is examined first).
+
+Reftable files not listed in `tables.list` may be new (and about to be
+added to the stack by the active writer), or ancient and ready to be
+pruned.
+
+Backward compatibility
+^^^^^^^^^^^^^^^^^^^^^^
+
+Older clients should continue to recognize the directory as a git
+repository so they don't look for an enclosing repository in parent
+directories. To this end, a reftable-enabled repository must contain the
+following dummy files
+
+* `.git/HEAD`, a regular file containing `ref: refs/heads/.invalid`.
+* `.git/refs/`, a directory
+* `.git/refs/heads`, a regular file
+
+Readers
+^^^^^^^
+
+Readers can obtain a consistent snapshot of the reference space by
+following:
+
+1. Open and read the `tables.list` file.
+2. Open each of the reftable files that it mentions.
+3. If any of the files is missing, goto 1.
+4. Read from the now-open files as long as necessary.
+
+Update transactions
+^^^^^^^^^^^^^^^^^^^
+
+Although reftables are immutable, mutations are supported by writing a
+new reftable and atomically appending it to the stack:
+
+1. Acquire `tables.list.lock`.
+2. Read `tables.list` to determine current reftables.
+3. Select `update_index` to be most recent file's
+`max_update_index + 1`.
+4. Prepare temp reftable `tmp_XXXXXX`, including log entries.
+5. Rename `tmp_XXXXXX` to `${update_index}-${update_index}-${random}.ref`.
+6. Copy `tables.list` to `tables.list.lock`, appending file from (5).
+7. Rename `tables.list.lock` to `tables.list`.
+
+During step 4 the new file's `min_update_index` and `max_update_index`
+are both set to the `update_index` selected by step 3. All log records
+for the transaction use the same `update_index` in their keys. This
+enables later correlation of which references were updated by the same
+transaction.
+
+Because a single `tables.list.lock` file is used to manage locking, the
+repository is single-threaded for writers. Writers may have to busy-spin
+(with backoff) around creating `tables.list.lock`, for up to an
+acceptable wait period, aborting if the repository is too busy to
+mutate. Application servers wrapped around repositories (e.g. Gerrit
+Code Review) can layer their own lock/wait queue to improve fairness to
+writers.
+
+Reference deletions
+^^^^^^^^^^^^^^^^^^^
+
+Deletion of any reference can be explicitly stored by setting the `type`
+to `0x0` and omitting the `value` field of the `ref_record`. This serves
+as a tombstone, overriding any assertions about the existence of the
+reference from earlier files in the stack.
+
+Compaction
+^^^^^^^^^^
+
+A partial stack of reftables can be compacted by merging references
+using a straightforward merge join across reftables, selecting the most
+recent value for output, and omitting deleted references that do not
+appear in remaining, lower reftables.
+
+A compacted reftable should set its `min_update_index` to the smallest
+of the input files' `min_update_index`, and its `max_update_index`
+likewise to the largest input `max_update_index`.
+
+For sake of illustration, assume the stack currently consists of
+reftable files (from oldest to newest): A, B, C, and D. The compactor is
+going to compact B and C, leaving A and D alone.
+
+1. Obtain lock `tables.list.lock` and read the `tables.list` file.
+2. Obtain locks `B.lock` and `C.lock`. Ownership of these locks
+prevents other processes from trying to compact these files.
+3. Release `tables.list.lock`.
+4. Compact `B` and `C` into a temp file
+`${min_update_index}-${max_update_index}_XXXXXX`.
+5. Reacquire lock `tables.list.lock`.
+6. Verify that `B` and `C` are still in the stack, in that order. This
+should always be the case, assuming that other processes are adhering to
+the locking protocol.
+7. Rename `${min_update_index}-${max_update_index}_XXXXXX` to
+`${min_update_index}-${max_update_index}-${random}.ref`.
+8. Write the new stack to `tables.list.lock`, replacing `B` and `C`
+with the file from (4).
+9. Rename `tables.list.lock` to `tables.list`.
+10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing
+readers to backtrack.
+
+This strategy permits compactions to proceed independently of updates.
+
+Each reftable (compacted or not) is uniquely identified by its name, so
+open reftables can be cached by their name.
+
+Windows
+^^^^^^^
+
+On windows, and other systems that do not allow deleting or renaming to open
+files, compaction may succeed, but other readers may prevent obsolete tables
+from being deleted.
+
+On these platforms, the following strategy can be followed: on closing a
+reftable stack, reload `tables.list`, and delete any tables no longer mentioned
+in `tables.list`.
+
+Irregular program exit may still leave about unused files. In this case, a
+cleanup operation should proceed as follows:
+
+* take a lock `tables.list.lock` to prevent concurrent modifications
+* refresh the reftable stack, by reading `tables.list`
+* for each `*.ref` file, remove it if
+** it is not mentioned in `tables.list`, and
+** its max update_index is not beyond the max update_index of the stack
+
+
+Alternatives considered
+~~~~~~~~~~~~~~~~~~~~~~~
+
+bzip packed-refs
+^^^^^^^^^^^^^^^^
+
+`bzip2` can significantly shrink a large packed-refs file (e.g. 62 MiB
+compresses to 23 MiB, 37%). However the bzip format does not support
+random access to a single reference. Readers must inflate and discard
+while performing a linear scan.
+
+Breaking packed-refs into chunks (individually compressing each chunk)
+would reduce the amount of data a reader must inflate, but still leaves
+the problem of indexing chunks to support readers efficiently locating
+the correct chunk.
+
+Given the compression achieved by reftable's encoding, it does not seem
+necessary to add the complexity of bzip/gzip/zlib.
+
+Michael Haggerty's alternate format
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Michael Haggerty proposed
+link:https://lore.kernel.org/git/CAMy9T_HCnyc1g8XWOOWhe7nN0aEFyyBskV2aOMb_fe%2BwGvEJ7A%40mail.gmail.com/[an
+alternate] format to reftable on the Git mailing list. This format uses
+smaller chunks, without the restart table, and avoids block alignment
+with padding. Reflog entries immediately follow each ref, and are thus
+interleaved between refs.
+
+Performance testing indicates reftable is faster for lookups (51%
+faster, 11.2 usec vs. 5.4 usec), although reftable produces a slightly
+larger file (+ ~3.2%, 28.3M vs 29.2M):
+
+[cols=">,>,>,>",options="header",]
+|=====================================
+|format |size |seek cold |seek hot
+|mh-alt |28.3 M |23.4 usec |11.2 usec
+|reftable |29.2 M |19.9 usec |5.4 usec
+|=====================================
+
+JGit Ketch RefTree
+^^^^^^^^^^^^^^^^^^
+
+https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html[JGit Ketch]
+proposed
+link:https://lore.kernel.org/git/CAJo%3DhJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg%40mail.gmail.com/[RefTree],
+an encoding of references inside Git tree objects stored as part of the
+repository's object database.
+
+The RefTree format adds additional load on the object database storage
+layer (more loose objects, more objects in packs), and relies heavily on
+the packer's delta compression to save space. Namespaces which are flat
+(e.g. thousands of tags in refs/tags) initially create very large loose
+objects, and so RefTree does not address the problem of copying many
+references to modify a handful.
+
+Flat namespaces are not efficiently searchable in RefTree, as tree
+objects in canonical formatting cannot be binary searched. This fails
+the need to handle a large number of references in a single namespace,
+such as GitHub's `refs/pulls`, or a project with many tags.
+
+LMDB
+^^^^
+
+David Turner proposed
+https://lore.kernel.org/git/1455772670-21142-26-git-send-email-dturner@twopensource.com/[using
+LMDB], as LMDB is lightweight (64k of runtime code) and GPL-compatible
+license.
+
+A downside of LMDB is its reliance on a single C implementation. This
+makes embedding inside JGit (a popular reimplementation of Git)
+difficult, and hoisting onto virtual storage (for JGit DFS) virtually
+impossible.
+
+A common format that can be supported by all major Git implementations
+(git-core, JGit, libgit2) is strongly preferred.
diff --git a/Documentation/technical/remembering-renames.txt b/Documentation/technical/remembering-renames.txt
new file mode 100644
index 0000000..73f4176
--- /dev/null
+++ b/Documentation/technical/remembering-renames.txt
@@ -0,0 +1,671 @@
+Rebases and cherry-picks involve a sequence of merges whose results are
+recorded as new single-parent commits. The first parent side of those
+merges represent the "upstream" side, and often include a far larger set of
+changes than the second parent side. Traditionally, the renames on the
+first-parent side of that sequence of merges were repeatedly re-detected
+for every merge. This file explains why it is safe and effective during
+rebases and cherry-picks to remember renames on the upstream side of
+history as an optimization, assuming all merges are automatic and clean
+(i.e. no conflicts and not interrupted for user input or editing).
+
+Outline:
+
+ 0. Assumptions
+
+ 1. How rebasing and cherry-picking work
+
+ 2. Why the renames on MERGE_SIDE1 in any given pick are *always* a
+ superset of the renames on MERGE_SIDE1 for the next pick.
+
+ 3. Why any rename on MERGE_SIDE1 in any given pick is _almost_ always also
+ a rename on MERGE_SIDE1 for the next pick
+
+ 4. A detailed description of the counter-examples to #3.
+
+ 5. Why the special cases in #4 are still fully reasonable to use to pair
+ up files for three-way content merging in the merge machinery, and why
+ they do not affect the correctness of the merge.
+
+ 6. Interaction with skipping of "irrelevant" renames
+
+ 7. Additional items that need to be cached
+
+ 8. How directory rename detection interacts with the above and why this
+ optimization is still safe even if merge.directoryRenames is set to
+ "true".
+
+
+=== 0. Assumptions ===
+
+There are two assumptions that will hold throughout this document:
+
+ * The upstream side where commits are transplanted to is treated as the
+ first parent side when rebase/cherry-pick call the merge machinery
+
+ * All merges are fully automatic
+
+and a third that will hold in sections 2-5 for simplicity, that I'll later
+address in section 8:
+
+ * No directory renames occur
+
+
+Let me explain more about each assumption and why I include it:
+
+
+The first assumption is merely for the purposes of making this document
+clearer; the optimization implementation does not actually depend upon it.
+However, the assumption does hold in all cases because it reflects the way
+that both rebase and cherry-pick were implemented; and the implementation
+of cherry-pick and rebase are not readily changeable for backwards
+compatibility reasons (see for example the discussion of the --ours and
+--theirs flag in the documentation of `git checkout`, particularly the
+comments about how they behave with rebase). The optimization avoids
+checking first-parent-ness, though. It checks the conditions that make the
+optimization valid instead, so it would still continue working if someone
+changed the parent ordering that cherry-pick and rebase use. But making
+this assumption does make this document much clearer and prevents me from
+having to repeat every example twice.
+
+If the second assumption is violated, then the optimization simply is
+turned off and thus isn't relevant to consider. The second assumption can
+also be stated as "there is no interruption for a user to resolve conflicts
+or to just further edit or tweak files". While real rebases and
+cherry-picks are often interrupted (either because it's an interactive
+rebase where the user requested to stop and edit, or because there were
+conflicts that the user needs to resolve), the cache of renames is not
+stored on disk, and thus is thrown away as soon as the rebase or cherry
+pick stops for the user to resolve the operation.
+
+The third assumption makes sections 2-5 simpler, and allows people to
+understand the basics of why this optimization is safe and effective, and
+then I can go back and address the specifics in section 8. It is probably
+also worth noting that if directory renames do occur, then the default of
+merge.directoryRenames being set to "conflict" means that the operation
+will stop for users to resolve the conflicts and the cache will be thrown
+away, and thus that there won't be an optimization to apply. So, the only
+reason we need to address directory renames specifically, is that some
+users will have set merge.directoryRenames to "true" to allow the merges to
+continue to proceed automatically. The optimization is still safe with
+this config setting, but we have to discuss a few more cases to show why;
+this discussion is deferred until section 8.
+
+
+=== 1. How rebasing and cherry-picking work ===
+
+Consider the following setup (from the git-rebase manpage):
+
+ A---B---C topic
+ /
+ D---E---F---G main
+
+After rebasing or cherry-picking topic onto main, this will appear as:
+
+ A'--B'--C' topic
+ /
+ D---E---F---G main
+
+The way the commits A', B', and C' are created is through a series of
+merges, where rebase or cherry-pick sequentially uses each of the three
+A-B-C commits in a special merge operation. Let's label the three commits
+in the merge operation as MERGE_BASE, MERGE_SIDE1, and MERGE_SIDE2. For
+this picture, the three commits for each of the three merges would be:
+
+To create A':
+ MERGE_BASE: E
+ MERGE_SIDE1: G
+ MERGE_SIDE2: A
+
+To create B':
+ MERGE_BASE: A
+ MERGE_SIDE1: A'
+ MERGE_SIDE2: B
+
+To create C':
+ MERGE_BASE: B
+ MERGE_SIDE1: B'
+ MERGE_SIDE2: C
+
+Sometimes, folks are surprised that these three-way merges are done. It
+can be useful in understanding these three-way merges to view them in a
+slightly different light. For example, in creating C', you can view it as
+either:
+
+ * Apply the changes between B & C to B'
+ * Apply the changes between B & B' to C
+
+Conceptually the two statements above are the same as a three-way merge of
+B, B', and C, at least the parts before you decide to record a commit.
+
+
+=== 2. Why the renames on MERGE_SIDE1 in any given pick are always a ===
+=== superset of the renames on MERGE_SIDE1 for the next pick. ===
+
+The merge machinery uses the filenames it is fed from MERGE_BASE,
+MERGE_SIDE1, and MERGE_SIDE2. It will only move content to a different
+filename under one of three conditions:
+
+ * To make both pieces of a conflict available to a user during conflict
+ resolution (examples: directory/file conflict, add/add type conflict
+ such as symlink vs. regular file)
+
+ * When MERGE_SIDE1 renames the file.
+
+ * When MERGE_SIDE2 renames the file.
+
+First, let's remember what commits are involved in the first and second
+picks of the cherry-pick or rebase sequence:
+
+To create A':
+ MERGE_BASE: E
+ MERGE_SIDE1: G
+ MERGE_SIDE2: A
+
+To create B':
+ MERGE_BASE: A
+ MERGE_SIDE1: A'
+ MERGE_SIDE2: B
+
+So, in particular, we need to show that the renames between E and G are a
+superset of those between A and A'.
+
+A' is created by the first merge. A' will only have renames for one of the
+three reasons listed above. The first case, a conflict, results in a
+situation where the cache is dropped and thus this optimization doesn't
+take effect, so we need not consider that case. The third case, a rename
+on MERGE_SIDE2 (i.e. from G to A), will show up in A' but it also shows up
+in A -- therefore when diffing A and A' that path does not show up as a
+rename. The only remaining way for renames to show up in A' is for the
+rename to come from MERGE_SIDE1. Therefore, all renames between A and A'
+are a subset of those between E and G. Equivalently, all renames between E
+and G are a superset of those between A and A'.
+
+
+=== 3. Why any rename on MERGE_SIDE1 in any given pick is _almost_ ===
+=== always also a rename on MERGE_SIDE1 for the next pick. ===
+
+Let's again look at the first two picks:
+
+To create A':
+ MERGE_BASE: E
+ MERGE_SIDE1: G
+ MERGE_SIDE2: A
+
+To create B':
+ MERGE_BASE: A
+ MERGE_SIDE1: A'
+ MERGE_SIDE2: B
+
+Now let's look at any given rename from MERGE_SIDE1 of the first pick, i.e.
+any given rename from E to G. Let's use the filenames 'oldfile' and
+'newfile' for demonstration purposes. That first pick will function as
+follows; when the rename is detected, the merge machinery will do a
+three-way content merge of the following:
+ E:oldfile
+ G:newfile
+ A:oldfile
+and produce a new result:
+ A':newfile
+
+Note above that I've assumed that E->A did not rename oldfile. If that
+side did rename, then we most likely have a rename/rename(1to2) conflict
+that will cause the rebase or cherry-pick operation to halt and drop the
+in-memory cache of renames and thus doesn't need to be considered further.
+In the special case that E->A does rename the file but also renames it to
+newfile, then there is no conflict from the renaming and the merge can
+succeed. In this special case, the rename is not valid to cache because
+the second merge will find A:newfile in the MERGE_BASE (see also the new
+testcases in t6429 with "rename same file identically" in their
+description). So a rename/rename(1to1) needs to be specially handled by
+pruning renames from the cache and decrementing the dir_rename_counts in
+the current and leading directories associated with those renames. Or,
+since these are really rare, one could just take the easy way out and
+disable the remembering renames optimization when a rename/rename(1to1)
+happens.
+
+The previous paragraph handled the cases for E->A renaming oldfile, let's
+continue assuming that oldfile is not renamed in A.
+
+As per the diagram for creating B', MERGE_SIDE1 involves the changes from A
+to A'. So, we are curious whether A:oldfile and A':newfile will be viewed
+as renames. Note that:
+
+ * There will be no A':oldfile (because there could not have been a
+ G:oldfile as we do not do break detection in the merge machinery and
+ G:newfile was detected as a rename, and by the construction of the
+ rename above that merged cleanly, the merge machinery will ensure there
+ is no 'oldfile' in the result).
+
+ * There will be no A:newfile (if there had been, we would have had a
+ rename/add conflict).
+
+ * Clearly A:oldfile and A':newfile are "related" (A':newfile came from a
+ clean three-way content merge involving A:oldfile).
+
+We can also expound on the third point above, by noting that three-way
+content merges can also be viewed as applying the differences between the
+base and one side to the other side. Thus we can view A':newfile as
+having been created by taking the changes between E:oldfile and G:newfile
+(which were detected as being related, i.e. <50% changed) to A:oldfile.
+
+Thus A:oldfile and A':newfile are just as related as E:oldfile and
+G:newfile are -- they have exactly identical differences. Since the latter
+were detected as renames, A:oldfile and A':newfile should also be
+detectable as renames almost always.
+
+
+=== 4. A detailed description of the counter-examples to #3. ===
+
+We already noted in section 3 that rename/rename(1to1) (i.e. both sides
+renaming a file the same way) was one counter-example. The more
+interesting bit, though, is why did we need to use the "almost" qualifier
+when stating that A:oldfile and A':newfile are "almost" always detectable
+as renames?
+
+Let's repeat an earlier point that section 3 made:
+
+ A':newfile was created by applying the changes between E:oldfile and
+ G:newfile to A:oldfile. The changes between E:oldfile and G:newfile were
+ <50% of the size of E:oldfile.
+
+If those changes that were <50% of the size of E:oldfile are also <50% of
+the size of A:oldfile, then A:oldfile and A':newfile will be detectable as
+renames. However, if there is a dramatic size reduction between E:oldfile
+and A:oldfile (but the changes between E:oldfile, G:newfile, and A:oldfile
+still somehow merge cleanly), then traditional rename detection would not
+detect A:oldfile and A':newfile as renames.
+
+Here's an example where that can happen:
+ * E:oldfile had 20 lines
+ * G:newfile added 10 new lines at the beginning of the file
+ * A:oldfile kept the first 3 lines of the file, and deleted all the rest
+then
+ => A':newfile would have 13 lines, 3 of which matches those in A:oldfile.
+E:oldfile -> G:newfile would be detected as a rename, but A:oldfile and
+A':newfile would not be.
+
+
+=== 5. Why the special cases in #4 are still fully reasonable to use to ===
+=== pair up files for three-way content merging in the merge machinery, ===
+=== and why they do not affect the correctness of the merge. ===
+
+In the rename/rename(1to1) case, A:newfile and A':newfile are not renames
+since they use the *same* filename. However, files with the same filename
+are obviously fine to pair up for three-way content merging (the merge
+machinery has never employed break detection). The interesting
+counter-example case is thus not the rename/rename(1to1) case, but the case
+where A did not rename oldfile. That was the case that we spent most of
+the time discussing in sections 3 and 4. The remainder of this section
+will be devoted to that case as well.
+
+So, even if A:oldfile and A':newfile aren't detectable as renames, why is
+it still reasonable to pair them up for three-way content merging in the
+merge machinery? There are multiple reasons:
+
+ * As noted in sections 3 and 4, the diff between A:oldfile and A':newfile
+ is *exactly* the same as the diff between E:oldfile and G:newfile. The
+ latter pair were detected as renames, so it seems unlikely to surprise
+ users for us to treat A:oldfile and A':newfile as renames.
+
+ * In fact, "oldfile" and "newfile" were at one point detected as renames
+ due to how they were constructed in the E..G chain. And we used that
+ information once already in this rebase/cherry-pick. I think users
+ would be unlikely to be surprised at us continuing to treat the files
+ as renames and would quickly understand why we had done so.
+
+ * Marking or declaring files as renames is *not* the end goal for merges.
+ Merges use renames to determine which files make sense to be paired up
+ for three-way content merges.
+
+ * A:oldfile and A':newfile were _already_ paired up in a three-way
+ content merge; that is how A':newfile was created. In fact, that
+ three-way content merge was clean. So using them again in a later
+ three-way content merge seems very reasonable.
+
+However, the above is focusing on the common scenarios. Let's try to look
+at all possible unusual scenarios and compare without the optimization to
+with the optimization. Consider the following theoretical cases; we will
+then dive into each to determine which of them are possible,
+and if so, what they mean:
+
+ 1. Without the optimization, the second merge results in a conflict.
+ With the optimization, the second merge also results in a conflict.
+ Questions: Are the conflicts confusingly different? Better in one case?
+
+ 2. Without the optimization, the second merge results in NO conflict.
+ With the optimization, the second merge also results in NO conflict.
+ Questions: Are the merges the same?
+
+ 3. Without the optimization, the second merge results in a conflict.
+ With the optimization, the second merge results in NO conflict.
+ Questions: Possible? Bug, bugfix, or something else?
+
+ 4. Without the optimization, the second merge results in NO conflict.
+ With the optimization, the second merge results in a conflict.
+ Questions: Possible? Bug, bugfix, or something else?
+
+I'll consider all four cases, but out of order.
+
+The fourth case is impossible. For the code without the remembering
+renames optimization to not get a conflict, B:oldfile would need to exactly
+match A:oldfile -- if it doesn't, there would be a modify/delete conflict.
+If A:oldfile matches B:oldfile exactly, then a three-way content merge
+between A:oldfile, A':newfile, and B:oldfile would have no conflict and
+just give us the version of newfile from A' as the result.
+
+From the same logic as the above paragraph, the second case would indeed
+result in identical merges. When A:oldfile exactly matches B:oldfile, an
+undetected rename would say, "Oh, I see one side didn't modify 'oldfile'
+and the other side deleted it. I'll delete it. And I see you have this
+brand new file named 'newfile' in A', so I'll keep it." That gives the
+same results as three-way content merging A:oldfile, A':newfile, and
+B:oldfile -- a removal of oldfile with the version of newfile from A'
+showing up in the result.
+
+The third case is interesting. It means that A:oldfile and A':newfile were
+not just similar enough, but that the changes between them did not conflict
+with the changes between A:oldfile and B:oldfile. This would validate our
+hunch that the files were similar enough to be used in a three-way content
+merge, and thus seems entirely correct for us to have used them that way.
+(Sidenote: One particular example here may be enlightening. Let's say that
+B was an immediate revert of A. B clearly would have been a clean revert
+of A, since A was B's immediate parent. One would assume that if you can
+pick a commit, you should also be able to cherry-pick its immediate revert.
+However, this is one of those funny corner cases; without this
+optimization, we just successfully picked a commit cleanly, but we are
+unable to cherry-pick its immediate revert due to the size differences
+between E:oldfile and A:oldfile.)
+
+That leaves only the first case to consider -- when we get conflicts both
+with or without the optimization. Without the optimization, we'll have a
+modify/delete conflict, where both A':newfile and B:oldfile are left in the
+tree for the user to deal with and no hints about the potential similarity
+between the two. With the optimization, we'll have a three-way content
+merged A:oldfile, A':newfile, and B:oldfile with conflict markers
+suggesting we thought the files were related but giving the user the chance
+to resolve. As noted above, I don't think users will find us treating
+'oldfile' and 'newfile' as related as a surprise since they were between E
+and G. In any event, though, this case shouldn't be concerning since we
+hit a conflict in both cases, told the user what we know, and asked them to
+resolve it.
+
+So, in summary, case 4 is impossible, case 2 yields the same behavior, and
+cases 1 and 3 seem to provide as good or better behavior with the
+optimization than without.
+
+
+=== 6. Interaction with skipping of "irrelevant" renames ===
+
+Previous optimizations involved skipping rename detection for paths
+considered to be "irrelevant". See for example the following commits:
+
+ * 32a56dfb99 ("merge-ort: precompute subset of sources for which we
+ need rename detection", 2021-03-11)
+ * 2fd9eda462 ("merge-ort: precompute whether directory rename
+ detection is needed", 2021-03-11)
+ * 9bd342137e ("diffcore-rename: determine which relevant_sources are
+ no longer relevant", 2021-03-13)
+
+Relevance is always determined by what the _other_ side of history has
+done, in terms of modifying a file that our side renamed, or adding a
+file to a directory which our side renamed. This means that a path
+that is "irrelevant" when picking the first commit of a series in a
+rebase or cherry-pick, may suddenly become "relevant" when picking the
+next commit.
+
+The upshot of this is that we can only cache rename detection results
+for relevant paths, and need to re-check relevance in subsequent
+commits. If those subsequent commits have additional paths that are
+relevant for rename detection, then we will need to redo rename
+detection -- though we can limit it to the paths for which we have not
+already detected renames.
+
+
+=== 7. Additional items that need to be cached ===
+
+It turns out we have to cache more than just renames; we also cache:
+
+ A) non-renames (i.e. unpaired deletes)
+ B) counts of renames within directories
+ C) sources that were marked as RELEVANT_LOCATION, but which were
+ downgraded to RELEVANT_NO_MORE
+ D) the toplevel trees involved in the merge
+
+These are all stored in struct rename_info, and respectively appear in
+ * cached_pairs (along side actual renames, just with a value of NULL)
+ * dir_rename_counts
+ * cached_irrelevant
+ * merge_trees
+
+The reason for (A) comes from the irrelevant renames skipping
+optimization discussed in section 6. The fact that irrelevant renames
+are skipped means we only get a subset of the potential renames
+detected and subsequent commits may need to run rename detection on
+the upstream side on a subset of the remaining renames (to get the
+renames that are relevant for that later commit). Since unpaired
+deletes are involved in rename detection too, we don't want to
+repeatedly check that those paths remain unpaired on the upstream side
+with every commit we are transplanting.
+
+The reason for (B) is that diffcore_rename_extended() is what
+generates the counts of renames by directory which is needed in
+directory rename detection, and if we don't run
+diffcore_rename_extended() again then we need to have the output from
+it, including dir_rename_counts, from the previous run.
+
+The reason for (C) is that merge-ort's tree traversal will again think
+those paths are relevant (marking them as RELEVANT_LOCATION), but the
+fact that they were downgraded to RELEVANT_NO_MORE means that
+dir_rename_counts already has the information we need for directory
+rename detection. (A path which becomes RELEVANT_CONTENT in a
+subsequent commit will be removed from cached_irrelevant.)
+
+The reason for (D) is that is how we determine whether the remember
+renames optimization can be used. In particular, remembering that our
+sequence of merges looks like:
+
+ Merge 1:
+ MERGE_BASE: E
+ MERGE_SIDE1: G
+ MERGE_SIDE2: A
+ => Creates A'
+
+ Merge 2:
+ MERGE_BASE: A
+ MERGE_SIDE1: A'
+ MERGE_SIDE2: B
+ => Creates B'
+
+It is the fact that the trees A and A' appear both in Merge 1 and in
+Merge 2, with A as a parent of A' that allows this optimization. So
+we store the trees to compare with what we are asked to merge next
+time.
+
+
+=== 8. How directory rename detection interacts with the above and ===
+=== why this optimization is still safe even if ===
+=== merge.directoryRenames is set to "true". ===
+
+As noted in the assumptions section:
+
+ """
+ ...if directory renames do occur, then the default of
+ merge.directoryRenames being set to "conflict" means that the operation
+ will stop for users to resolve the conflicts and the cache will be
+ thrown away, and thus that there won't be an optimization to apply.
+ So, the only reason we need to address directory renames specifically,
+ is that some users will have set merge.directoryRenames to "true" to
+ allow the merges to continue to proceed automatically.
+ """
+
+Let's remember that we need to look at how any given pick affects the next
+one. So let's again use the first two picks from the diagram in section
+one:
+
+ First pick does this three-way merge:
+ MERGE_BASE: E
+ MERGE_SIDE1: G
+ MERGE_SIDE2: A
+ => creates A'
+
+ Second pick does this three-way merge:
+ MERGE_BASE: A
+ MERGE_SIDE1: A'
+ MERGE_SIDE2: B
+ => creates B'
+
+Now, directory rename detection exists so that if one side of history
+renames a directory, and the other side adds a new file to the old
+directory, then the merge (with merge.directoryRenames=true) can move the
+file into the new directory. There are two qualitatively different ways to
+add a new file to an old directory: create a new file, or rename a file
+into that directory. Also, directory renames can be done on either side of
+history, so there are four cases to consider:
+
+ * MERGE_SIDE1 renames old dir, MERGE_SIDE2 adds new file to old dir
+ * MERGE_SIDE1 renames old dir, MERGE_SIDE2 renames file into old dir
+ * MERGE_SIDE1 adds new file to old dir, MERGE_SIDE2 renames old dir
+ * MERGE_SIDE1 renames file into old dir, MERGE_SIDE2 renames old dir
+
+One last note before we consider these four cases: There are some
+important properties about how we implement this optimization with
+respect to directory rename detection that we need to bear in mind
+while considering all of these cases:
+
+ * rename caching occurs *after* applying directory renames
+
+ * a rename created by directory rename detection is recorded for the side
+ of history that did the directory rename.
+
+ * dir_rename_counts, the nested map of
+ {oldname => {newname => count}},
+ is cached between runs as well. This basically means that directory
+ rename detection is also cached, though only on the side of history
+ that we cache renames for (MERGE_SIDE1 as far as this document is
+ concerned; see the assumptions section). Two interesting sub-notes
+ about these counts:
+
+ * If we need to perform rename-detection again on the given side (e.g.
+ some paths are relevant for rename detection that weren't before),
+ then we clear dir_rename_counts and recompute it, making use of
+ cached_pairs. The reason it is important to do this is optimizations
+ around RELEVANT_LOCATION exist to prevent us from computing
+ unnecessary renames for directory rename detection and from computing
+ dir_rename_counts for irrelevant directories; but those same renames
+ or directories may become necessary for subsequent merges. The
+ easiest way to "fix up" dir_rename_counts in such cases is to just
+ recompute it.
+
+ * If we prune rename/rename(1to1) entries from the cache, then we also
+ need to update dir_rename_counts to decrement the counts for the
+ involved directory and any relevant parent directories (to undo what
+ update_dir_rename_counts() in diffcore-rename.c incremented when the
+ rename was initially found). If we instead just disable the
+ remembering renames optimization when the exceedingly rare
+ rename/rename(1to1) cases occur, then dir_rename_counts will get
+ re-computed the next time rename detection occurs, as noted above.
+
+ * the side with multiple commits to pick, is the side of history that we
+ do NOT cache renames for. Thus, there are no additional commits to
+ change the number of renames in a directory, except for those done by
+ directory rename detection (which always pad the majority).
+
+ * the "renames" we cache are modified slightly by any directory rename,
+ as noted below.
+
+Now, with those notes out of the way, let's go through the four cases
+in order:
+
+Case 1: MERGE_SIDE1 renames old dir, MERGE_SIDE2 adds new file to old dir
+
+ This case looks like this:
+
+ MERGE_BASE: E, Has olddir/
+ MERGE_SIDE1: G, Renames olddir/ -> newdir/
+ MERGE_SIDE2: A, Adds olddir/newfile
+ => creates A', With newdir/newfile
+
+ MERGE_BASE: A, Has olddir/newfile
+ MERGE_SIDE1: A', Has newdir/newfile
+ MERGE_SIDE2: B, Modifies olddir/newfile
+ => expected B', with threeway-merged newdir/newfile from above
+
+ In this case, with the optimization, note that after the first commit:
+ * MERGE_SIDE1 remembers olddir/ -> newdir/
+ * MERGE_SIDE1 has cached olddir/newfile -> newdir/newfile
+ Given the cached rename noted above, the second merge can proceed as
+ expected without needing to perform rename detection from A -> A'.
+
+Case 2: MERGE_SIDE1 renames old dir, MERGE_SIDE2 renames file into old dir
+
+ This case looks like this:
+ MERGE_BASE: E oldfile, olddir/
+ MERGE_SIDE1: G oldfile, olddir/ -> newdir/
+ MERGE_SIDE2: A oldfile -> olddir/newfile
+ => creates A', With newdir/newfile representing original oldfile
+
+ MERGE_BASE: A olddir/newfile
+ MERGE_SIDE1: A' newdir/newfile
+ MERGE_SIDE2: B modify olddir/newfile
+ => expected B', with threeway-merged newdir/newfile from above
+
+ In this case, with the optimization, note that after the first commit:
+ * MERGE_SIDE1 remembers olddir/ -> newdir/
+ * MERGE_SIDE1 has cached olddir/newfile -> newdir/newfile
+ (NOT oldfile -> newdir/newfile; compare to case with
+ (p->status == 'R' && new_path) in possibly_cache_new_pair())
+
+ Given the cached rename noted above, the second merge can proceed as
+ expected without needing to perform rename detection from A -> A'.
+
+Case 3: MERGE_SIDE1 adds new file to old dir, MERGE_SIDE2 renames old dir
+
+ This case looks like this:
+
+ MERGE_BASE: E, Has olddir/
+ MERGE_SIDE1: G, Adds olddir/newfile
+ MERGE_SIDE2: A, Renames olddir/ -> newdir/
+ => creates A', With newdir/newfile
+
+ MERGE_BASE: A, Has newdir/, but no notion of newdir/newfile
+ MERGE_SIDE1: A', Has newdir/newfile
+ MERGE_SIDE2: B, Has newdir/, but no notion of newdir/newfile
+ => expected B', with newdir/newfile from A'
+
+ In this case, with the optimization, note that after the first commit there
+ were no renames on MERGE_SIDE1, and any renames on MERGE_SIDE2 are tossed.
+ But the second merge didn't need any renames so this is fine.
+
+Case 4: MERGE_SIDE1 renames file into old dir, MERGE_SIDE2 renames old dir
+
+ This case looks like this:
+
+ MERGE_BASE: E, Has olddir/
+ MERGE_SIDE1: G, Renames oldfile -> olddir/newfile
+ MERGE_SIDE2: A, Renames olddir/ -> newdir/
+ => creates A', With newdir/newfile representing original oldfile
+
+ MERGE_BASE: A, Has oldfile
+ MERGE_SIDE1: A', Has newdir/newfile
+ MERGE_SIDE2: B, Modifies oldfile
+ => expected B', with threeway-merged newdir/newfile from above
+
+ In this case, with the optimization, note that after the first commit:
+ * MERGE_SIDE1 remembers oldfile -> newdir/newfile
+ (NOT oldfile -> olddir/newfile; compare to case of second
+ block under p->status == 'R' in possibly_cache_new_pair())
+ * MERGE_SIDE2 renames are tossed because only MERGE_SIDE1 is remembered
+
+ Given the cached rename noted above, the second merge can proceed as
+ expected without needing to perform rename detection from A -> A'.
+
+Finally, I'll just note here that interactions with the
+skip-irrelevant-renames optimization means we sometimes don't detect
+renames for any files within a directory that was renamed, in which
+case we will not have been able to detect any rename for the directory
+itself. In such a case, we do not know whether the directory was
+renamed; we want to be careful to avoid caching some kind of "this
+directory was not renamed" statement. If we did, then a subsequent
+commit being rebased could add a file to the old directory, and the
+user would expect it to end up in the correct directory -- something
+our erroneous "this directory was not renamed" cache would preclude.
diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
index 7844ef3..4728142 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -82,9 +82,9 @@ When the config key `extensions.preciousObjects` is set to `true`,
objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
`git repack -d`).
-==== `partialclone`
+==== `partialClone`
-When the config key `extensions.partialclone` is set, it indicates
+When the config key `extensions.partialClone` is set, it indicates
that the repo was created with a partial clone (or later performed
a partial fetch) and that the remote may have omitted sending
certain unwanted objects. Such a remote is called a "promisor remote"
@@ -96,7 +96,13 @@ The value of this key is the name of the promisor remote.
==== `worktreeConfig`
If set, by default "git config" reads from both "config" and
-"config.worktree" file from GIT_DIR in that order. In
+"config.worktree" files from GIT_DIR in that order. In
multiple working directory mode, "config" file is shared while
"config.worktree" is per-working directory (i.e., it's in
GIT_COMMON_DIR/worktrees/<id>/config.worktree)
+
+==== `refStorage`
+
+Specifies the file format for the ref database. The valid values are
+`files` (loose references with a packed-refs file) and `reftable` (see
+Documentation/technical/reftable.txt).
diff --git a/Documentation/technical/rerere.txt b/Documentation/technical/rerere.txt
index aa22d7a..580f233 100644
--- a/Documentation/technical/rerere.txt
+++ b/Documentation/technical/rerere.txt
@@ -14,9 +14,9 @@ conflicts before writing them to the rerere database.
Different conflict styles and branch names are normalized by stripping
the labels from the conflict markers, and removing the common ancestor
-version from the `diff3` conflict style. Branches that are merged
-in different order are normalized by sorting the conflict hunks. More
-on each of those steps in the following sections.
+version from the `diff3` or `zdiff3` conflict styles. Branches that
+are merged in different order are normalized by sorting the conflict
+hunks. More on each of those steps in the following sections.
Once these two normalization operations are applied, a conflict ID is
calculated based on the normalized conflict, which is later used by
@@ -42,8 +42,8 @@ get a conflict like the following:
>>>>>>> AC
Doing the analogous with AC2 (forking a branch ABAC2 off of branch AB
-and then merging branch AC2 into it), using the diff3 conflict style,
-we get a conflict like the following:
+and then merging branch AC2 into it), using the diff3 or zdiff3
+conflict style, we get a conflict like the following:
<<<<<<< HEAD
B
@@ -60,7 +60,7 @@ By resolving this conflict, to leave line D, the user declares:
what AB and AC wanted to do.
As branch AC2 refers to the same commit as AC, the above implies that
-this is also compatible what AB and AC2 wanted to do.
+this is also compatible with what AB and AC2 wanted to do.
By extension, this means that rerere should recognize that the above
conflicts are the same. To do this, the labels on the conflict
@@ -76,7 +76,7 @@ examples would both result in the following normalized conflict:
Sorting hunks
~~~~~~~~~~~~~
-As before, lets imagine that a common ancestor had a file with line A
+As before, let's imagine that a common ancestor had a file with line A
its early part, and line X in its late part. And then four branches
are forked that do these things:
@@ -99,7 +99,7 @@ conflict to leave line D means that the user declares:
compatible with what AB and AC wanted to do.
So the conflict we would see when merging AB into ACAB should be
-resolved the same way---it is the resolution that is in line with that
+resolved the same way--it is the resolution that is in line with that
declaration.
Imagine that similarly previously a branch XYXZ was forked from XY,
@@ -117,7 +117,7 @@ early A became C or B, a late X became Y or Z". We can see there are
4 combinations of ("B or C", "C or B") x ("X or Y", "Y or X").
By sorting, the conflict is given its canonical name, namely, "an
-early part became B or C, a late part becames X or Y", and whenever
+early part became B or C, a late part became X or Y", and whenever
any of these four patterns appear, and we can get to the same conflict
and resolution that we saw earlier.
@@ -145,7 +145,7 @@ Nested conflicts
Nested conflicts are handled very similarly to "simple" conflicts.
Similar to simple conflicts, the conflict is first normalized by
stripping the labels from conflict markers, stripping the common ancestor
-version, and the sorting the conflict hunks, both for the outer and the
+version, and sorting the conflict hunks, both for the outer and the
inner conflict. This is done recursively, so any number of nested
conflicts can be handled.
diff --git a/Documentation/technical/scalar.txt b/Documentation/technical/scalar.txt
new file mode 100644
index 0000000..921cb10
--- /dev/null
+++ b/Documentation/technical/scalar.txt
@@ -0,0 +1,66 @@
+Scalar
+======
+
+Scalar is a repository management tool that optimizes Git for use in large
+repositories. It accomplishes this by helping users to take advantage of
+advanced performance features in Git. Unlike most other Git built-in commands,
+Scalar is not executed as a subcommand of 'git'; rather, it is built as a
+separate executable containing its own series of subcommands.
+
+Background
+----------
+
+Scalar was originally designed as an add-on to Git and implemented as a .NET
+Core application. It was created based on the learnings from the VFS for Git
+project (another application aimed at improving the experience of working with
+large repositories). As part of its initial implementation, Scalar relied on
+custom features in the Microsoft fork of Git that have since been integrated
+into core Git:
+
+* partial clone,
+* commit graphs,
+* multi-pack index,
+* sparse checkout (cone mode),
+* scheduled background maintenance,
+* etc
+
+With the requisite Git functionality in place and a desire to bring the benefits
+of Scalar to the larger Git community, the Scalar application itself was ported
+from C# to C and integrated upstream.
+
+Features
+--------
+
+Scalar is comprised of two major pieces of functionality: automatically
+configuring built-in Git performance features and managing repository
+enlistments.
+
+The Git performance features configured by Scalar (see "Background" for
+examples) confer substantial performance benefits to large repositories, but are
+either too experimental to enable for all of Git yet, or only benefit large
+repositories. As new features are introduced, Scalar should be updated
+accordingly to incorporate them. This will prevent the tool from becoming stale
+while also providing a path for more easily bringing features to the appropriate
+users.
+
+Enlistments are how Scalar knows which repositories on a user's system should
+utilize Scalar-configured features. This allows it to update performance
+settings when new ones are added to the tool, as well as centrally manage
+repository maintenance. The enlistment structure - a root directory with a
+`src/` subdirectory containing the cloned repository itself - is designed to
+encourage users to route build outputs outside of the repository to avoid the
+performance-limiting overhead of ignoring those files in Git.
+
+Design
+------
+
+Scalar is implemented in C and interacts with Git via a mix of child process
+invocations of Git and direct usage of `libgit.a`. Internally, it is structured
+much like other built-ins with subcommands (e.g., `git stash`), containing a
+`cmd_<subcommand>()` function for each subcommand, routed through a `cmd_main()`
+function. Most options are unique to each subcommand, with `scalar` respecting
+some "global" `git` options (e.g., `-c` and `-C`).
+
+Because `scalar` is not invoked as a Git subcommand (like `git scalar`), it is
+built and installed as its own executable in the `bin/` directory, alongside
+`git`, `git-gui`, etc.
diff --git a/Documentation/technical/shallow.txt b/Documentation/technical/shallow.txt
index 01dedfe..f3738ba 100644
--- a/Documentation/technical/shallow.txt
+++ b/Documentation/technical/shallow.txt
@@ -13,7 +13,7 @@ pretend as if they are root commits (e.g. "git log" traversal
stops after showing them; "git fsck" does not complain saying
the commits listed on their "parent" lines do not exist).
-Each line contains exactly one SHA-1. When read, a commit_graft
+Each line contains exactly one object name. When read, a commit_graft
will be constructed, which has nr_parent < 0 to make it easier
to discern from user provided grafts.
diff --git a/Documentation/technical/signature-format.txt b/Documentation/technical/signature-format.txt
deleted file mode 100644
index 2c9406a..0000000
--- a/Documentation/technical/signature-format.txt
+++ /dev/null
@@ -1,186 +0,0 @@
-Git signature format
-====================
-
-== Overview
-
-Git uses cryptographic signatures in various places, currently objects (tags,
-commits, mergetags) and transactions (pushes). In every case, the command which
-is about to create an object or transaction determines a payload from that,
-calls gpg to obtain a detached signature for the payload (`gpg -bsa`) and
-embeds the signature into the object or transaction.
-
-Signatures always begin with `-----BEGIN PGP SIGNATURE-----`
-and end with `-----END PGP SIGNATURE-----`, unless gpg is told to
-produce RFC1991 signatures which use `MESSAGE` instead of `SIGNATURE`.
-
-The signed payload and the way the signature is embedded depends
-on the type of the object resp. transaction.
-
-== Tag signatures
-
-- created by: `git tag -s`
-- payload: annotated tag object
-- embedding: append the signature to the unsigned tag object
-- example: tag `signedtag` with subject `signed tag`
-
-----
-object 04b871796dc0420f8e7561a895b52484b701d51a
-type commit
-tag signedtag
-tagger C O Mitter <committer@example.com> 1465981006 +0000
-
-signed tag
-
-signed tag message body
------BEGIN PGP SIGNATURE-----
-Version: GnuPG v1
-
-iQEcBAABAgAGBQJXYRhOAAoJEGEJLoW3InGJklkIAIcnhL7RwEb/+QeX9enkXhxn
-rxfdqrvWd1K80sl2TOt8Bg/NYwrUBw/RWJ+sg/hhHp4WtvE1HDGHlkEz3y11Lkuh
-8tSxS3qKTxXUGozyPGuE90sJfExhZlW4knIQ1wt/yWqM+33E9pN4hzPqLwyrdods
-q8FWEqPPUbSJXoMbRPw04S5jrLtZSsUWbRYjmJCHzlhSfFWW4eFd37uquIaLUBS0
-rkC3Jrx7420jkIpgFcTI2s60uhSQLzgcCwdA2ukSYIRnjg/zDkj8+3h/GaROJ72x
-lZyI6HWixKJkWw8lE9aAOD9TmTW9sFJwcVAzmAuFX2kUreDUKMZduGcoRYGpD7E=
-=jpXa
------END PGP SIGNATURE-----
-----
-
-- verify with: `git verify-tag [-v]` or `git tag -v`
-
-----
-gpg: Signature made Wed Jun 15 10:56:46 2016 CEST using RSA key ID B7227189
-gpg: Good signature from "Eris Discordia <discord@example.net>"
-gpg: WARNING: This key is not certified with a trusted signature!
-gpg: There is no indication that the signature belongs to the owner.
-Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189
-object 04b871796dc0420f8e7561a895b52484b701d51a
-type commit
-tag signedtag
-tagger C O Mitter <committer@example.com> 1465981006 +0000
-
-signed tag
-
-signed tag message body
-----
-
-== Commit signatures
-
-- created by: `git commit -S`
-- payload: commit object
-- embedding: header entry `gpgsig`
- (content is preceded by a space)
-- example: commit with subject `signed commit`
-
-----
-tree eebfed94e75e7760540d1485c740902590a00332
-parent 04b871796dc0420f8e7561a895b52484b701d51a
-author A U Thor <author@example.com> 1465981137 +0000
-committer C O Mitter <committer@example.com> 1465981137 +0000
-gpgsig -----BEGIN PGP SIGNATURE-----
- Version: GnuPG v1
-
- iQEcBAABAgAGBQJXYRjRAAoJEGEJLoW3InGJ3IwIAIY4SA6GxY3BjL60YyvsJPh/
- HRCJwH+w7wt3Yc/9/bW2F+gF72kdHOOs2jfv+OZhq0q4OAN6fvVSczISY/82LpS7
- DVdMQj2/YcHDT4xrDNBnXnviDO9G7am/9OE77kEbXrp7QPxvhjkicHNwy2rEflAA
- zn075rtEERDHr8nRYiDh8eVrefSO7D+bdQ7gv+7GsYMsd2auJWi1dHOSfTr9HIF4
- HJhWXT9d2f8W+diRYXGh4X0wYiGg6na/soXc+vdtDYBzIxanRqjg8jCAeo1eOTk1
- EdTwhcTZlI0x5pvJ3H0+4hA2jtldVtmPM4OTB0cTrEWBad7XV6YgiyuII73Ve3I=
- =jKHM
- -----END PGP SIGNATURE-----
-
-signed commit
-
-signed commit message body
-----
-
-- verify with: `git verify-commit [-v]` (or `git show --show-signature`)
-
-----
-gpg: Signature made Wed Jun 15 10:58:57 2016 CEST using RSA key ID B7227189
-gpg: Good signature from "Eris Discordia <discord@example.net>"
-gpg: WARNING: This key is not certified with a trusted signature!
-gpg: There is no indication that the signature belongs to the owner.
-Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189
-tree eebfed94e75e7760540d1485c740902590a00332
-parent 04b871796dc0420f8e7561a895b52484b701d51a
-author A U Thor <author@example.com> 1465981137 +0000
-committer C O Mitter <committer@example.com> 1465981137 +0000
-
-signed commit
-
-signed commit message body
-----
-
-== Mergetag signatures
-
-- created by: `git merge` on signed tag
-- payload/embedding: the whole signed tag object is embedded into
- the (merge) commit object as header entry `mergetag`
-- example: merge of the signed tag `signedtag` as above
-
-----
-tree c7b1cff039a93f3600a1d18b82d26688668c7dea
-parent c33429be94b5f2d3ee9b0adad223f877f174b05d
-parent 04b871796dc0420f8e7561a895b52484b701d51a
-author A U Thor <author@example.com> 1465982009 +0000
-committer C O Mitter <committer@example.com> 1465982009 +0000
-mergetag object 04b871796dc0420f8e7561a895b52484b701d51a
- type commit
- tag signedtag
- tagger C O Mitter <committer@example.com> 1465981006 +0000
-
- signed tag
-
- signed tag message body
- -----BEGIN PGP SIGNATURE-----
- Version: GnuPG v1
-
- iQEcBAABAgAGBQJXYRhOAAoJEGEJLoW3InGJklkIAIcnhL7RwEb/+QeX9enkXhxn
- rxfdqrvWd1K80sl2TOt8Bg/NYwrUBw/RWJ+sg/hhHp4WtvE1HDGHlkEz3y11Lkuh
- 8tSxS3qKTxXUGozyPGuE90sJfExhZlW4knIQ1wt/yWqM+33E9pN4hzPqLwyrdods
- q8FWEqPPUbSJXoMbRPw04S5jrLtZSsUWbRYjmJCHzlhSfFWW4eFd37uquIaLUBS0
- rkC3Jrx7420jkIpgFcTI2s60uhSQLzgcCwdA2ukSYIRnjg/zDkj8+3h/GaROJ72x
- lZyI6HWixKJkWw8lE9aAOD9TmTW9sFJwcVAzmAuFX2kUreDUKMZduGcoRYGpD7E=
- =jpXa
- -----END PGP SIGNATURE-----
-
-Merge tag 'signedtag' into downstream
-
-signed tag
-
-signed tag message body
-
-# gpg: Signature made Wed Jun 15 08:56:46 2016 UTC using RSA key ID B7227189
-# gpg: Good signature from "Eris Discordia <discord@example.net>"
-# gpg: WARNING: This key is not certified with a trusted signature!
-# gpg: There is no indication that the signature belongs to the owner.
-# Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189
-----
-
-- verify with: verification is embedded in merge commit message by default,
- alternatively with `git show --show-signature`:
-
-----
-commit 9863f0c76ff78712b6800e199a46aa56afbcbd49
-merged tag 'signedtag'
-gpg: Signature made Wed Jun 15 10:56:46 2016 CEST using RSA key ID B7227189
-gpg: Good signature from "Eris Discordia <discord@example.net>"
-gpg: WARNING: This key is not certified with a trusted signature!
-gpg: There is no indication that the signature belongs to the owner.
-Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189
-Merge: c33429b 04b8717
-Author: A U Thor <author@example.com>
-Date: Wed Jun 15 09:13:29 2016 +0000
-
- Merge tag 'signedtag' into downstream
-
- signed tag
-
- signed tag message body
-
- # gpg: Signature made Wed Jun 15 08:56:46 2016 UTC using RSA key ID B7227189
- # gpg: Good signature from "Eris Discordia <discord@example.net>"
- # gpg: WARNING: This key is not certified with a trusted signature!
- # gpg: There is no indication that the signature belongs to the owner.
- # Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189
-----
diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
new file mode 100644
index 0000000..fa0d01c
--- /dev/null
+++ b/Documentation/technical/sparse-checkout.txt
@@ -0,0 +1,1103 @@
+Table of contents:
+
+ * Terminology
+ * Purpose of sparse-checkouts
+ * Usecases of primary concern
+ * Oversimplified mental models ("Cliff Notes" for this document!)
+ * Desired behavior
+ * Behavior classes
+ * Subcommand-dependent defaults
+ * Sparse specification vs. sparsity patterns
+ * Implementation Questions
+ * Implementation Goals/Plans
+ * Known bugs
+ * Reference Emails
+
+
+=== Terminology ===
+
+cone mode: one of two modes for specifying the desired subset of files
+ in a sparse-checkout. In cone-mode, the user specifies
+ directories (getting both everything under that directory as
+ well as everything in leading directories), while in non-cone
+ mode, the user specifies gitignore-style patterns. Controlled
+ by the --[no-]cone option to sparse-checkout init|set.
+
+SKIP_WORKTREE: When tracked files do not match the sparse specification and
+ are removed from the working tree, the file in the index is marked
+ with a SKIP_WORKTREE bit. Note that if a tracked file has the
+ SKIP_WORKTREE bit set but the file is later written by the user to
+ the working tree anyway, the SKIP_WORKTREE bit will be cleared at
+ the beginning of any subsequent Git operation.
+
+ Most sparse checkout users are unaware of this implementation
+ detail, and the term should generally be avoided in user-facing
+ descriptions and command flags. Unfortunately, prior to the
+ `sparse-checkout` subcommand this low-level detail was exposed,
+ and as of time of writing, is still exposed in various places.
+
+sparse-checkout: a subcommand in git used to reduce the files present in
+ the working tree to a subset of all tracked files. Also, the
+ name of the file in the $GIT_DIR/info directory used to track
+ the sparsity patterns corresponding to the user's desired
+ subset.
+
+sparse cone: see cone mode
+
+sparse directory: An entry in the index corresponding to a directory, which
+ appears in the index instead of all the files under that directory
+ that would normally appear. See also sparse-index. Something that
+ can cause confusion is that the "sparse directory" does NOT match
+ the sparse specification, i.e. the directory is NOT present in the
+ working tree. May be renamed in the future (e.g. to "skipped
+ directory").
+
+sparse index: A special mode for sparse-checkout that also makes the
+ index sparse by recording a directory entry in lieu of all the
+ files underneath that directory (thus making that a "skipped
+ directory" which unfortunately has also been called a "sparse
+ directory"), and does this for potentially multiple
+ directories. Controlled by the --[no-]sparse-index option to
+ init|set|reapply.
+
+sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
+ define the set of files of interest. A warning: It is easy to
+ over-use this term (or the shortened "patterns" term), for two
+ reasons: (1) users in cone mode specify directories rather than
+ patterns (their directories are transformed into patterns, but
+ users may think you are talking about non-cone mode if you use the
+ word "patterns"), and (b) the sparse specification might
+ transiently differ in the working tree or index from the sparsity
+ patterns (see "Sparse specification vs. sparsity patterns").
+
+sparse specification: The set of paths in the user's area of focus. This
+ is typically just the tracked files that match the sparsity
+ patterns, but the sparse specification can temporarily differ and
+ include additional files. (See also "Sparse specification
+ vs. sparsity patterns")
+
+ * When working with history, the sparse specification is exactly
+ the set of files matching the sparsity patterns.
+ * When interacting with the working tree, the sparse specification
+ is the set of tracked files with a clear SKIP_WORKTREE bit or
+ tracked files present in the working copy.
+ * When modifying or showing results from the index, the sparse
+ specification is the set of files with a clear SKIP_WORKTREE bit
+ or that differ in the index from HEAD.
+ * If working with the index and the working copy, the sparse
+ specification is the union of the paths from above.
+
+vivifying: When a command restores a tracked file to the working tree (and
+ hopefully also clears the SKIP_WORKTREE bit in the index for that
+ file), this is referred to as "vivifying" the file.
+
+
+=== Purpose of sparse-checkouts ===
+
+sparse-checkouts exist to allow users to work with a subset of their
+files.
+
+You can think of sparse-checkouts as subdividing "tracked" files into two
+categories -- a sparse subset, and all the rest. Implementationally, we
+mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them
+out of the working tree. The SKIP_WORKTREE files are still tracked, just
+not present in the working tree.
+
+In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
+is missing from the working tree but pretend the file contents match HEAD".
+That was not only bogus (it actually meant the file missing from the
+working tree matched the index rather than HEAD), but it was also a
+low-level detail which only provided decent behavior for a few commands.
+There were a surprising number of ways in which that guiding principle gave
+command results that violated user expectations, and as such was a bad
+mental model. However, it persisted for many years and may still be found
+in some corners of the code base.
+
+Anyway, the idea of "working with a subset of files" is simple enough, but
+there are multiple different high-level usecases which affect how some Git
+subcommands should behave. Further, even if we only considered one of
+those usecases, sparse-checkouts can modify different subcommands in over a
+half dozen different ways. Let's start by considering the high level
+usecases:
+
+ A) Users are _only_ interested in the sparse portion of the repo
+
+ A*) Users are _only_ interested in the sparse portion of the repo
+ that they have downloaded so far
+
+ B) Users want a sparse working tree, but are working in a larger whole
+
+ C) sparse-checkout is a behind-the-scenes implementation detail allowing
+ Git to work with a specially crafted in-house virtual file system;
+ users are actually working with a "full" working tree that is
+ lazily populated, and sparse-checkout helps with the lazy population
+ piece.
+
+It may be worth explaining each of these in a bit more detail:
+
+
+ (Behavior A) Users are _only_ interested in the sparse portion of the repo
+
+These folks might know there are other things in the repository, but
+don't care. They are uninterested in other parts of the repository, and
+only want to know about changes within their area of interest. Showing
+them other files from history (e.g. from diff/log/grep/etc.) is a
+usability annoyance, potentially a huge one since other changes in
+history may dwarf the changes they are interested in.
+
+Some of these users also arrive at this usecase from wanting to use partial
+clones together with sparse checkouts (in a way where they have downloaded
+blobs within the sparse specification) and do disconnected development.
+Not only do these users generally not care about other parts of the
+repository, but consider it a blocker for Git commands to try to operate on
+those. If commands attempt to access paths in history outside the sparsity
+specification, then the partial clone will attempt to download additional
+blobs on demand, fail, and then fail the user's command. (This may be
+unavoidable in some cases, e.g. when `git merge` has non-trivial changes to
+reconcile outside the sparse specification, but we should limit how often
+users are forced to connect to the network.)
+
+Also, even for users using partial clones that do not mind being
+always connected to the network, the need to download blobs as
+side-effects of various other commands (such as the printed diffstat
+after a merge or pull) can lead to worries about local repository size
+growing unnecessarily[10].
+
+ (Behavior A*) Users are _only_ interested in the sparse portion of the repo
+ that they have downloaded so far (a variant on the first usecase)
+
+This variant is driven by folks who using partial clones together with
+sparse checkouts and do disconnected development (so far sounding like a
+subset of behavior A users) and doing so on very large repositories. The
+reason for yet another variant is that downloading even just the blobs
+through history within their sparse specification may be too much, so they
+only download some. They would still like operations to succeed without
+network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
+or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
+partial results that depend on what happens to have been downloaded.
+
+This variant could be viewed as Behavior A with the sparse specification
+for history querying operations modified from "sparsity patterns" to
+"sparsity patterns limited to the blobs we have already downloaded".
+
+ (Behavior B) Users want a sparse working tree, but are working in a
+ larger whole
+
+Stolee described this usecase this way[11]:
+
+"I'm also focused on users that know that they are a part of a larger
+whole. They know they are operating on a large repository but focus on
+what they need to contribute their part. I expect multiple "roles" to
+use very different, almost disjoint parts of the codebase. Some other
+"architect" users operate across the entire tree or hop between different
+sections of the codebase as necessary. In this situation, I'm wary of
+scoping too many features to the sparse-checkout definition, especially
+"git log," as it can be too confusing to have their view of the codebase
+depend on your "point of view."
+
+People might also end up wanting behavior B due to complex inter-project
+dependencies. The initial attempts to use sparse-checkouts usually involve
+the directories you are directly interested in plus what those directories
+depend upon within your repository. But there's a monkey wrench here: if
+you have integration tests, they invert the hierarchy: to run integration
+tests, you need not only what you are interested in and its in-tree
+dependencies, you also need everything that depends upon what you are
+interested in or that depends upon one of your dependencies...AND you need
+all the in-tree dependencies of that expanded group. That can easily
+change your sparse-checkout into a nearly dense one.
+
+Naturally, that tends to kill the benefits of sparse-checkouts. There are
+a couple solutions to this conundrum: either avoid grabbing in-repo
+dependencies (maybe have built versions of your in-repo dependencies pulled
+from a CI cache somewhere), or say that users shouldn't run integration
+tests directly and instead do it on the CI server when they submit a code
+review. Or do both. Regardless of whether you stub out your in-repo
+dependencies or stub out the things that depend upon you, there is
+certainly a reason to want to query and be aware of those other stubbed-out
+parts of the repository, particularly when the dependencies are complex or
+change relatively frequently. Thus, for such uses, sparse-checkouts can be
+used to limit what you directly build and modify, but these users do not
+necessarily want their sparse checkout paths to limit their queries of
+versions in history.
+
+Some people may also be interested in behavior B over behavior A simply as
+a performance workaround: if they are using non-cone mode, then they have
+to deal with its inherent quadratic performance problems. In that mode,
+every operation that checks whether paths match the sparsity specification
+can be expensive. As such, these users may only be willing to pay for
+those expensive checks when interacting with the working copy, and may
+prefer getting "unrelated" results from their history queries over having
+slow commands.
+
+ (Behavior C) sparse-checkout is an implementational detail supporting a
+ special VFS.
+
+This usecase goes slightly against the traditional definition of
+sparse-checkout in that it actually tries to present a full or dense
+checkout to the user. However, this usecase utilizes the same underlying
+technical underpinnings in a new way which does provide some performance
+advantages to users. The basic idea is that a company can have an in-house
+Git-aware Virtual File System which pretends all files are present in the
+working tree, by intercepting all file system accesses and using those to
+fetch and write accessed files on demand via partial clones. The VFS uses
+sparse-checkout to prevent Git from writing or paying attention to many
+files, and manually updates the sparse checkout patterns itself based on
+user access and modification of files in the working tree. See commit
+ecc7c8841d ("repo_read_index: add config to expect files outside sparse
+patterns", 2022-02-25) and the link at [17] for a more detailed description
+of such a VFS.
+
+The biggest difference here is that users are completely unaware that the
+sparse-checkout machinery is even in use. The sparse patterns are not
+specified by the user but rather are under the complete control of the VFS
+(and the patterns are updated frequently and dynamically by it). The user
+will perceive the checkout as dense, and commands should thus behave as if
+all files are present.
+
+
+=== Usecases of primary concern ===
+
+Most of the rest of this document will focus on Behavior A and Behavior
+B. Some notes about the other two cases and why we are not focusing on
+them:
+
+ (Behavior A*)
+
+Supporting this usecase is estimated to be difficult and a lot of work.
+There are no plans to implement it currently, but it may be a potential
+future alternative. Knowing about the existence of additional alternatives
+may affect our choice of command line flags (e.g. if we need tri-state or
+quad-state flags rather than just binary flags), so it was still important
+to at least note.
+
+Further, I believe the descriptions below for Behavior A are probably still
+valid for this usecase, with the only exception being that it redefines the
+sparse specification to restrict it to already-downloaded blobs. The hard
+part is in making commands capable of respecting that modified definition.
+
+ (Behavior C)
+
+This usecase violates some of the early sparse-checkout documented
+assumptions (since files marked as SKIP_WORKTREE will be displayed to users
+as present in the working tree). That violation may mean various
+sparse-checkout related behaviors are not well suited to this usecase and
+we may need tweaks -- to both documentation and code -- to handle it.
+However, this usecase is also perhaps the simplest model to support in that
+everything behaves like a dense checkout with a few exceptions (e.g. branch
+checkouts and switches write fewer things, knowing the VFS will lazily
+write the rest on an as-needed basis).
+
+Since there is no publically available VFS-related code for folks to try,
+the number of folks who can test such a usecase is limited.
+
+The primary reason to note the Behavior C usecase is that as we fix things
+to better support Behaviors A and B, there may be additional places where
+we need to make tweaks allowing folks in this usecase to get the original
+non-sparse treatment. For an example, see ecc7c8841d ("repo_read_index:
+add config to expect files outside sparse patterns", 2022-02-25). The
+secondary reason to note Behavior C, is so that folks taking advantage of
+Behavior C do not assume they are part of the Behavior B camp and propose
+patches that break things for the real Behavior B folks.
+
+
+=== Oversimplified mental models ===
+
+An oversimplification of the differences in the above behaviors is:
+
+ Behavior A: Restrict worktree and history operations to sparse specification
+ Behavior B: Restrict worktree operations to sparse specification; have any
+ history operations work across all files
+ Behavior C: Do not restrict either worktree or history operations to the
+ sparse specification...with the exception of branch checkouts or
+ switches which avoid writing files that will match the index so
+ they can later lazily be populated instead.
+
+
+=== Desired behavior ===
+
+As noted previously, despite the simple idea of just working with a subset
+of files, there are a range of different behavioral changes that need to be
+made to different subcommands to work well with such a feature. See
+[1,2,3,4,5,6,7,8,9,10] for various examples. In particular, at [2], we saw
+that mere composition of other commands that individually worked correctly
+in a sparse-checkout context did not imply that the higher level command
+would work correctly; it sometimes requires further tweaks. So,
+understanding these differences can be beneficial.
+
+* Commands behaving the same regardless of high-level use-case
+
+ * commands that only look at files within the sparsity specification
+
+ * diff (without --cached or REVISION arguments)
+ * grep (without --cached or REVISION arguments)
+ * diff-files
+
+ * commands that restore files to the working tree that match sparsity
+ patterns, and remove unmodified files that don't match those
+ patterns:
+
+ * switch
+ * checkout (the switch-like half)
+ * read-tree
+ * reset --hard
+
+ * commands that write conflicted files to the working tree, but otherwise
+ will omit writing files to the working tree that do not match the
+ sparsity patterns:
+
+ * merge
+ * rebase
+ * cherry-pick
+ * revert
+
+ * `am` and `apply --cached` should probably be in this section but
+ are buggy (see the "Known bugs" section below)
+
+ The behavior for these commands somewhat depends upon the merge
+ strategy being used:
+ * `ort` behaves as described above
+ * `recursive` tries to not vivify files unnecessarily, but does sometimes
+ vivify files without conflicts.
+ * `octopus` and `resolve` will always vivify any file changed in the merge
+ relative to the first parent, which is rather suboptimal.
+
+ It is also important to note that these commands WILL update the index
+ outside the sparse specification relative to when the operation began,
+ BUT these commands often make a commit just before or after such that
+ by the end of the operation there is no change to the index outside the
+ sparse specification. Of course, if the operation hits conflicts or
+ does not make a commit, then these operations clearly can modify the
+ index outside the sparse specification.
+
+ Finally, it is important to note that at least the first four of these
+ commands also try to remove differences between the sparse
+ specification and the sparsity patterns (much like the commands in the
+ previous section).
+
+ * commands that always ignore sparsity since commits must be full-tree
+
+ * archive
+ * bundle
+ * commit
+ * format-patch
+ * fast-export
+ * fast-import
+ * commit-tree
+
+ * commands that write any modified file to the working tree (conflicted
+ or not, and whether those paths match sparsity patterns or not):
+
+ * stash
+ * apply (without `--index` or `--cached`)
+
+* Commands that may slightly differ for behavior A vs. behavior B:
+
+ Commands in this category behave mostly the same between the two
+ behaviors, but may differ in verbosity and types of warning and error
+ messages.
+
+ * commands that make modifications to which files are tracked:
+ * add
+ * rm
+ * mv
+ * update-index
+
+ The fact that files can move between the 'tracked' and 'untracked'
+ categories means some commands will have to treat untracked files
+ differently. But if we have to treat untracked files differently,
+ then additional commands may also need changes:
+
+ * status
+ * clean
+
+ In particular, `status` may need to report any untracked files outside
+ the sparsity specification as an erroneous condition (especially to
+ avoid the user trying to `git add` them, forcing `git add` to display
+ an error).
+
+ It's not clear to me exactly how (or even if) `clean` would change,
+ but it's the other command that also affects untracked files.
+
+ `update-index` may be slightly special. Its --[no-]skip-worktree flag
+ may need to ignore the sparse specification by its nature. Also, its
+ current --[no-]ignore-skip-worktree-entries default is totally bogus.
+
+ * commands for manually tweaking paths in both the index and the working tree
+ * `restore`
+ * the restore-like half of `checkout`
+
+ These commands should be similar to add/rm/mv in that they should
+ only operate on the sparse specification by default, and require a
+ special flag to operate on all files.
+
+ Also, note that these commands currently have a number of issues (see
+ the "Known bugs" section below)
+
+* Commands that significantly differ for behavior A vs. behavior B:
+
+ * commands that query history
+ * diff (with --cached or REVISION arguments)
+ * grep (with --cached or REVISION arguments)
+ * show (when given commit arguments)
+ * blame (only matters when one or more -C flags are passed)
+ * and annotate
+ * log
+ * whatchanged
+ * ls-files
+ * diff-index
+ * diff-tree
+ * ls-tree
+
+ Note: for log and whatchanged, revision walking logic is unaffected
+ but displaying of patches is affected by scoping the command to the
+ sparse-checkout. (The fact that revision walking is unaffected is
+ why rev-list, shortlog, show-branch, and bisect are not in this
+ list.)
+
+ ls-files may be slightly special in that e.g. `git ls-files -t` is
+ often used to see what is sparse and what is not. Perhaps -t should
+ always work on the full tree?
+
+* Commands I don't know how to classify
+
+ * range-diff
+
+ Is this like `log` or `format-patch`?
+
+ * cherry
+
+ See range-diff
+
+* Commands unaffected by sparse-checkouts
+
+ * shortlog
+ * show-branch
+ * rev-list
+ * bisect
+
+ * branch
+ * describe
+ * fetch
+ * gc
+ * init
+ * maintenance
+ * notes
+ * pull (merge & rebase have the necessary changes)
+ * push
+ * submodule
+ * tag
+
+ * config
+ * filter-branch (works in separate checkout without sparse-checkout setup)
+ * pack-refs
+ * prune
+ * remote
+ * repack
+ * replace
+
+ * bugreport
+ * count-objects
+ * fsck
+ * gitweb
+ * help
+ * instaweb
+ * merge-tree (doesn't touch worktree or index, and merges always compute full-tree)
+ * rerere
+ * verify-commit
+ * verify-tag
+
+ * commit-graph
+ * hash-object
+ * index-pack
+ * mktag
+ * mktree
+ * multi-pack-index
+ * pack-objects
+ * prune-packed
+ * symbolic-ref
+ * unpack-objects
+ * update-ref
+ * write-tree (operates on index, possibly optimized to use sparse dir entries)
+
+ * for-each-ref
+ * get-tar-commit-id
+ * ls-remote
+ * merge-base (merges are computed full tree, so merge base should be too)
+ * name-rev
+ * pack-redundant
+ * rev-parse
+ * show-index
+ * show-ref
+ * unpack-file
+ * var
+ * verify-pack
+
+ * <Everything under 'Interacting with Others' in 'git help --all'>
+ * <Everything under 'Low-level...Syncing' in 'git help --all'>
+ * <Everything under 'Low-level...Internal Helpers' in 'git help --all'>
+ * <Everything under 'External commands' in 'git help --all'>
+
+* Commands that might be affected, but who cares?
+
+ * merge-file
+ * merge-index
+ * gitk?
+
+
+=== Behavior classes ===
+
+From the above there are a few classes of behavior:
+
+ * "restrict"
+
+ Commands in this class only read or write files in the working tree
+ within the sparse specification.
+
+ When moving to a new commit (e.g. switch, reset --hard), these commands
+ may update index files outside the sparse specification as of the start
+ of the operation, but by the end of the operation those index files
+ will match HEAD again and thus those files will again be outside the
+ sparse specification.
+
+ When paths are explicitly specified, these paths are intersected with
+ the sparse specification and will only operate on such paths.
+ (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`)
+
+ Some of these commands may also attempt, at the end of their operation,
+ to cull transient differences between the sparse specification and the
+ sparsity patterns (see "Sparse specification vs. sparsity patterns" for
+ details, but this basically means either removing unmodified files not
+ matching the sparsity patterns and marking those files as
+ SKIP_WORKTREE, or vivifying files that match the sparsity patterns and
+ marking those files as !SKIP_WORKTREE).
+
+ * "restrict modulo conflicts"
+
+ Commands in this class generally behave like the "restrict" class,
+ except that:
+ (1) they will ignore the sparse specification and write files with
+ conflicts to the working tree (thus temporarily expanding the
+ sparse specification to include such files.)
+ (2) they are grouped with commands which move to a new commit, since
+ they often create a commit and then move to it, even though we
+ know there are many exceptions to moving to the new commit. (For
+ example, the user may rebase a commit that becomes empty, or have
+ a cherry-pick which conflicts, or a user could run `merge
+ --no-commit`, and we also view `apply --index` kind of like `am
+ --no-commit`.) As such, these commands can make changes to index
+ files outside the sparse specification, though they'll mark such
+ files with SKIP_WORKTREE.
+
+ * "restrict also specially applied to untracked files"
+
+ Commands in this class generally behave like the "restrict" class,
+ except that they have to handle untracked files differently too, often
+ because these commands are dealing with files changing state between
+ 'tracked' and 'untracked'. Often, this may mean printing an error
+ message if the command had nothing to do, but the arguments may have
+ referred to files whose tracked-ness state could have changed were it
+ not for the sparsity patterns excluding them.
+
+ * "no restrict"
+
+ Commands in this class ignore the sparse specification entirely.
+
+ * "restrict or no restrict dependent upon behavior A vs. behavior B"
+
+ Commands in this class behave like "no restrict" for folks in the
+ behavior B camp, and like "restrict" for folks in the behavior A camp.
+ However, when behaving like "restrict" a warning of some sort might be
+ provided that history queries have been limited by the sparse-checkout
+ specification.
+
+
+=== Subcommand-dependent defaults ===
+
+Note that we have different defaults depending on the command for the
+desired behavior :
+
+ * Commands defaulting to "restrict":
+ * diff-files
+ * diff (without --cached or REVISION arguments)
+ * grep (without --cached or REVISION arguments)
+ * switch
+ * checkout (the switch-like half)
+ * reset (<commit>)
+
+ * restore
+ * checkout (the restore-like half)
+ * checkout-index
+ * reset (with pathspec)
+
+ This behavior makes sense; these interact with the working tree.
+
+ * Commands defaulting to "restrict modulo conflicts":
+ * merge
+ * rebase
+ * cherry-pick
+ * revert
+
+ * am
+ * apply --index (which is kind of like an `am --no-commit`)
+
+ * read-tree (especially with -m or -u; is kind of like a --no-commit merge)
+ * reset (<tree-ish>, due to similarity to read-tree)
+
+ These also interact with the working tree, but require slightly
+ different behavior either so that (a) conflicts can be resolved or (b)
+ because they are kind of like a merge-without-commit operation.
+
+ (See also the "Known bugs" section below regarding `am` and `apply`)
+
+ * Commands defaulting to "no restrict":
+ * archive
+ * bundle
+ * commit
+ * format-patch
+ * fast-export
+ * fast-import
+ * commit-tree
+
+ * stash
+ * apply (without `--index`)
+
+ These have completely different defaults and perhaps deserve the most
+ detailed explanation:
+
+ In the case of commands in the first group (format-patch,
+ fast-export, bundle, archive, etc.), these are commands for
+ communicating history, which will be broken if they restrict to a
+ subset of the repository. As such, they operate on full paths and
+ have no `--restrict` option for overriding. Some of these commands may
+ take paths for manually restricting what is exported, but it needs to
+ be very explicit.
+
+ In the case of stash, it needs to vivify files to avoid losing the
+ user's changes.
+
+ In the case of apply without `--index`, that command needs to update
+ the working tree without the index (or the index without the working
+ tree if `--cached` is passed), and if we restrict those updates to the
+ sparse specification then we'll lose changes from the user.
+
+ * Commands defaulting to "restrict also specially applied to untracked files":
+ * add
+ * rm
+ * mv
+ * update-index
+ * status
+ * clean (?)
+
+ Our original implementation for the first three of these commands was
+ "no restrict", but it had some severe usability issues:
+ * `git add <somefile>` if honored and outside the sparse
+ specification, can result in the file randomly disappearing later
+ when some subsequent command is run (since various commands
+ automatically clean up unmodified files outside the sparse
+ specification).
+ * `git rm '*.jpg'` could very negatively surprise users if it deletes
+ files outside the range of the user's interest.
+ * `git mv` has similar surprises when moving into or out of the cone,
+ so best to restrict by default
+
+ So, we switched `add` and `rm` to default to "restrict", which made
+ usability problems much less severe and less frequent, but we still got
+ complaints because commands like:
+ git add <file-outside-sparse-specification>
+ git rm <file-outside-sparse-specification>
+ would silently do nothing. We should instead print an error in those
+ cases to get usability right.
+
+ update-index needs to be updated to match, and status and maybe clean
+ also need to be updated to specially handle untracked paths.
+
+ There may be a difference in here between behavior A and behavior B in
+ terms of verboseness of errors or additional warnings.
+
+ * Commands falling under "restrict or no restrict dependent upon behavior
+ A vs. behavior B"
+
+ * diff (with --cached or REVISION arguments)
+ * grep (with --cached or REVISION arguments)
+ * show (when given commit arguments)
+ * blame (only matters when one or more -C flags passed)
+ * and annotate
+ * log
+ * and variants: shortlog, gitk, show-branch, whatchanged, rev-list
+ * ls-files
+ * diff-index
+ * diff-tree
+ * ls-tree
+
+ For now, we default to behavior B for these, which want a default of
+ "no restrict".
+
+ Note that two of these commands -- diff and grep -- also appeared in a
+ different list with a default of "restrict", but only when limited to
+ searching the working tree. The working tree vs. history distinction
+ is fundamental in how behavior B operates, so this is expected. Note,
+ though, that for diff and grep with --cached, when doing "restrict"
+ behavior, the difference between sparse specification and sparsity
+ patterns is important to handle.
+
+ "restrict" may make more sense as the long term default for these[12].
+ Also, supporting "restrict" for these commands might be a fair amount
+ of work to implement, meaning it might be implemented over multiple
+ releases. If that behavior were the default in the commands that
+ supported it, that would force behavior B users to need to learn to
+ slowly add additional flags to their commands, depending on git
+ version, to get the behavior they want. That gradual switchover would
+ be painful, so we should avoid it at least until it's fully
+ implemented.
+
+
+=== Sparse specification vs. sparsity patterns ===
+
+In a well-behaved situation, the sparse specification is given directly
+by the $GIT_DIR/info/sparse-checkout file. However, it can transiently
+diverge for a few reasons:
+
+ * needing to resolve conflicts (merging will vivify conflicted files)
+ * running Git commands that implicitly vivify files (e.g. "git stash apply")
+ * running Git commands that explicitly vivify files (e.g. "git checkout
+ --ignore-skip-worktree-bits FILENAME")
+ * other commands that write to these files (perhaps a user copies it
+ from elsewhere)
+
+For the last item, note that we do automatically clear the SKIP_WORKTREE
+bit for files that are present in the working tree. This has been true
+since 82386b4496 ("Merge branch 'en/present-despite-skipped'",
+2022-03-09)
+
+However, such a situation is transient because:
+
+ * Such transient differences can and will be automatically removed as
+ a side-effect of commands which call unpack_trees() (checkout,
+ merge, reset, etc.).
+ * Users can also request such transient differences be corrected via
+ running `git sparse-checkout reapply`. Various places recommend
+ running that command.
+ * Additional commands are also welcome to implicitly fix these
+ differences; we may add more in the future.
+
+While we avoid dropping unstaged changes or files which have conflicts,
+we otherwise aggressively try to fix these transient differences. If
+users want these differences to persist, they should run the `set` or
+`add` subcommands of `git sparse-checkout` to reflect their intended
+sparse specification.
+
+However, when we need to do a query on history restricted to the
+"relevant subset of files" such a transiently expanded sparse
+specification is ignored. There are a couple reasons for this:
+
+ * The behavior wanted when doing something like
+ git grep expression REVISION
+ is roughly what the users would expect from
+ git checkout REVISION && git grep expression
+ (modulo a "REVISION:" prefix), which has a couple ramifications:
+
+ * REVISION may have paths not in the current index, so there is no
+ path we can consult for a SKIP_WORKTREE setting for those paths.
+
+ * Since `checkout` is one of those commands that tries to remove
+ transient differences in the sparse specification, it makes sense
+ to use the corrected sparse specification
+ (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to
+ consult SKIP_WORKTREE anyway.
+
+So, a transiently expanded (or restricted) sparse specification applies to
+the working tree, but not to history queries where we always use the
+sparsity patterns. (See [16] for an early discussion of this.)
+
+Similar to a transiently expanded sparse specification of the working tree
+based on additional files being present in the working tree, we also need
+to consider additional files being modified in the index. In particular,
+if the user has staged changes to files (relative to HEAD) that do not
+match the sparsity patterns, and the file is not present in the working
+tree, we still want to consider the file part of the sparse specification
+if we are specifically performing a query related to the index (e.g. git
+diff --cached [REVISION], git diff-index [REVISION], git restore --staged
+--source=REVISION -- PATHS, etc.) Note that a transiently expanded sparse
+specification for the index usually only matters under behavior A, since
+under behavior B index operations are lumped with history and tend to
+operate full-tree.
+
+
+=== Implementation Questions ===
+
+ * Do the options --scope={sparse,all} sound good to others? Are there better
+ options?
+ * Names in use, or appearing in patches, or previously suggested:
+ * --sparse/--dense
+ * --ignore-skip-worktree-bits
+ * --ignore-skip-worktree-entries
+ * --ignore-sparsity
+ * --[no-]restrict-to-sparse-paths
+ * --full-tree/--sparse-tree
+ * --[no-]restrict
+ * --scope={sparse,all}
+ * --focus/--unfocus
+ * --limit/--unlimited
+ * Rationale making me lean slightly towards --scope={sparse,all}:
+ * We want a name that works for many commands, so we need a name that
+ does not conflict
+ * We know that we have more than two possible usecases, so it is best
+ to avoid a flag that appears to be binary.
+ * --scope={sparse,all} isn't overly long and seems relatively
+ explanatory
+ * `--sparse`, as used in add/rm/mv, is totally backwards for
+ grep/log/etc. Changing the meaning of `--sparse` for these
+ commands would fix the backwardness, but possibly break existing
+ scripts. Using a new name pairing would allow us to treat
+ `--sparse` in these commands as a deprecated alias.
+ * There is a different `--sparse`/`--dense` pair for commands using
+ revision machinery, so using that naming might cause confusion
+ * There is also a `--sparse` in both pack-objects and show-branch, which
+ don't conflict but do suggest that `--sparse` is overloaded
+ * The name --ignore-skip-worktree-bits is a double negative, is
+ quite a mouthful, refers to an implementation detail that many
+ users may not be familiar with, and we'd need a negation for it
+ which would probably be even more ridiculously long. (But we
+ can make --ignore-skip-worktree-bits a deprecated alias for
+ --no-restrict.)
+
+ * If a config option is added (sparse.scope?) what should the values and
+ description be? "sparse" (behavior A), "worktree-sparse-history-dense"
+ (behavior B), "dense" (behavior C)? There's a risk of confusion,
+ because even for Behaviors A and B we want some commands to be
+ full-tree and others to operate sparsely, so the wording may need to be
+ more tied to the usecases and somehow explain that. Also, right now,
+ the primary difference we are focusing is just the history-querying
+ commands (log/diff/grep). Previous config suggestion here: [13]
+
+ * Is `--no-expand` a good alias for ls-files's `--sparse` option?
+ (`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
+ because in non-cone mode it does nothing and in cone-mode it shows the
+ sparse directory entries which are technically outside the sparse
+ specification)
+
+ * Under Behavior A:
+ * Does ls-files' `--no-expand` override the default `--scope=all`, or
+ does it need an extra flag?
+ * Does ls-files' `-t` option imply `--scope=all`?
+ * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?
+
+ * sparse-checkout: once behavior A is fully implemented, should we take
+ an interim measure to ease people into switching the default? Namely,
+ if folks are not already in a sparse checkout, then require
+ `sparse-checkout init/set` to take a
+ `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
+ would set sparse.scope according to the setting given), and throw an
+ error if the flag is not provided? That error would be a great place
+ to warn folks that the default may change in the future, and get them
+ used to specifying what they want so that the eventual default switch
+ is seamless for them.
+
+
+=== Implementation Goals/Plans ===
+
+ * Get buy-in on this document in general.
+
+ * Figure out answers to the 'Implementation Questions' sections (above)
+
+ * Fix bugs in the 'Known bugs' section (below)
+
+ * Provide some kind of method for backfilling the blobs within the sparse
+ specification in a partial clone
+
+ [Below here is kind of spitballing since the first two haven't been resolved]
+
+ * update-index: flip the default to --no-ignore-skip-worktree-entries,
+ nuke this stupid "Oh, there's a bug? Let me add a flag to let users
+ request that they not trigger this bug." flag
+
+ * Flags & Config
+ * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`
+ * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
+ a deprecated aliases for `--scope=all`
+ * Create config option (sparse.scope?), tie it to the "Cliff notes"
+ overview
+
+ * Add --scope=sparse (and --scope=all) flag to each of the history querying
+ commands. IMPORTANT: make sure diff machinery changes don't mess with
+ format-patch, fast-export, etc.
+
+=== Known bugs ===
+
+This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've
+been working on it.
+
+0. Behavior A is not well supported in Git. (Behavior B didn't used to
+ be either, but was the easier of the two to implement.)
+
+1. am and apply:
+
+ apply, without `--index` or `--cached`, relies on files being present
+ in the working copy, and also writes to them unconditionally. As
+ such, it should first check for the files' presence, and if found to
+ be SKIP_WORKTREE, then clear the bit and vivify the paths, then do
+ its work. Currently, it just throws an error.
+
+ apply, with either `--cached` or `--index`, will not preserve the
+ SKIP_WORKTREE bit. This is fine if the file has conflicts, but
+ otherwise SKIP_WORKTREE bits should be preserved for --cached and
+ probably also for --index.
+
+ am, if there are no conflicts, will vivify files and fail to preserve
+ the SKIP_WORKTREE bit. If there are conflicts and `-3` is not
+ specified, it will vivify files and then complain the patch doesn't
+ apply. If there are conflicts and `-3` is specified, it will vivify
+ files and then complain that those vivified files would be
+ overwritten by merge.
+
+2. reset --hard:
+
+ reset --hard provides confusing error message (works correctly, but
+ misleads the user into believing it didn't):
+
+ $ touch addme
+ $ git add addme
+ $ git ls-files -t
+ H addme
+ H tracked
+ S tracked-but-maybe-skipped
+ $ git reset --hard # usually works great
+ error: Path 'addme' not uptodate; will not remove from working tree.
+ HEAD is now at bdbbb6f third
+ $ git ls-files -t
+ H tracked
+ S tracked-but-maybe-skipped
+ $ ls -1
+ tracked
+
+ `git reset --hard` DID remove addme from the index and the working tree, contrary
+ to the error message, but in line with how reset --hard should behave.
+
+3. read-tree
+
+ `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the
+ entries it reads into the index, resulting in all your files suddenly
+ appearing to be "deleted".
+
+4. Checkout, restore:
+
+ These command do not handle path & revision arguments appropriately:
+
+ $ ls
+ tracked
+ $ git ls-files -t
+ H tracked
+ S tracked-but-maybe-skipped
+ $ git status --porcelain
+ $ git checkout -- '*skipped'
+ error: pathspec '*skipped' did not match any file(s) known to git
+ $ git ls-files -- '*skipped'
+ tracked-but-maybe-skipped
+ $ git checkout HEAD -- '*skipped'
+ error: pathspec '*skipped' did not match any file(s) known to git
+ $ git ls-tree HEAD | grep skipped
+ 100644 blob 276f5a64354b791b13840f02047738c77ad0584f tracked-but-maybe-skipped
+ $ git status --porcelain
+ $ git checkout HEAD~1 -- '*skipped'
+ $ git ls-files -t
+ H tracked
+ H tracked-but-maybe-skipped
+ $ git status --porcelain
+ M tracked-but-maybe-skipped
+ $ git checkout HEAD -- '*skipped'
+ $ git status --porcelain
+ $
+
+ Note that checkout without a revision (or restore --staged) fails to
+ find a file to restore from the index, even though ls-files shows
+ such a file certainly exists.
+
+ Similar issues occur with HEAD (--source=HEAD in restore's case),
+ but suddenly works when HEAD~1 is specified. And then after that it
+ will work with HEAD specified, even though it didn't before.
+
+ Directories are also an issue:
+
+ $ git sparse-checkout set nomatches
+ $ git status
+ On branch main
+ You are in a sparse checkout with 0% of tracked files present.
+
+ nothing to commit, working tree clean
+ $ git checkout .
+ error: pathspec '.' did not match any file(s) known to git
+ $ git checkout HEAD~1 .
+ Updated 1 path from 58916d9
+ $ git ls-files -t
+ S tracked
+ H tracked-but-maybe-skipped
+
+5. checkout and restore --staged, continued:
+
+ These commands do not correctly scope operations to the sparse
+ specification, and make it worse by not setting important SKIP_WORKTREE
+ bits:
+
+ $ git restore --source OLDREV --staged outside-sparse-cone/
+ $ git status --porcelain
+ MD outside-sparse-cone/file1
+ MD outside-sparse-cone/file2
+ MD outside-sparse-cone/file3
+
+ We can add a --scope=all mode to `git restore` to let it operate outside
+ the sparse specification, but then it will be important to set the
+ SKIP_WORKTREE bits appropriately.
+
+6. Performance issues; see:
+ https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
+
+
+=== Reference Emails ===
+
+Emails that detail various bugs we've had in sparse-checkout:
+
+[1] (Original descriptions of behavior A & behavior B)
+ https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
+[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences)
+ https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/
+[3] (Present-despite-skipped entries)
+ https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
+[4] (Clone --no-checkout interaction)
+ https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout)
+[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`)
+ https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/
+[6] (SKIP_WORKTREE is advisory, not mandatory)
+ https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/
+[7] (`worktree add` should copy sparsity settings from current worktree)
+ https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/
+[8] (Avoid negative surprises in add, rm, and mv)
+ https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/
+ https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/
+[9] (Move from out-of-cone to in-cone)
+ https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/
+ https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/
+[10] (Unnecessarily downloading objects outside sparse specification)
+ https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/
+
+[11] (Stolee's comments on high-level usecases)
+ https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
+
+[12] Others commenting on eventually switching default to behavior A:
+ * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
+ * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
+ * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
+
+[13] Previous config name suggestion and description
+ * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/
+
+[14] Tangential issue: switch to cone mode as default sparse specification mechanism:
+ https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/
+
+[15] Lengthy email on grep behavior, covering what should be searched:
+ * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/
+
+[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations,
+ search for the parenthetical comment starting "We do not check".
+ https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/
+
+[17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/
diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
new file mode 100644
index 0000000..3b24c1a
--- /dev/null
+++ b/Documentation/technical/sparse-index.txt
@@ -0,0 +1,208 @@
+Git Sparse-Index Design Document
+================================
+
+The sparse-checkout feature allows users to focus a working directory on
+a subset of the files at HEAD. The cone mode patterns, enabled by
+`core.sparseCheckoutCone`, allow for very fast pattern matching to
+discover which files at HEAD belong in the sparse-checkout cone.
+
+Three important scale dimensions for a Git working directory are:
+
+* `HEAD`: How many files are present at `HEAD`?
+
+* Populated: How many files are within the sparse-checkout cone.
+
+* Modified: How many files has the user modified in the working directory?
+
+We will use big-O notation -- O(X) -- to denote how expensive certain
+operations are in terms of these dimensions.
+
+These dimensions are ordered by their magnitude: users (typically) modify
+fewer files than are populated, and we can only populate files at `HEAD`.
+
+Problems occur if there is an extreme imbalance in these dimensions. For
+example, if `HEAD` contains millions of paths but the populated set has
+only tens of thousands, then commands like `git status` and `git add` can
+be dominated by operations that require O(`HEAD`) operations instead of
+O(Populated). Primarily, the cost is in parsing and rewriting the index,
+which is filled primarily with files at `HEAD` that are marked with the
+`SKIP_WORKTREE` bit.
+
+The sparse-index intends to take these commands that read and modify the
+index from O(`HEAD`) to O(Populated). To do this, we need to modify the
+index format in a significant way: add "sparse directory" entries.
+
+With cone mode patterns, it is possible to detect when an entire
+directory will have its contents outside of the sparse-checkout definition.
+Instead of listing all of the files it contains as individual entries, a
+sparse-index contains an entry with the directory name, referencing the
+object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
+If we need to discover the details for paths within that directory, we
+can parse trees to find that list.
+
+At time of writing, sparse-directory entries violate expectations about the
+index format and its in-memory data structure. There are many consumers in
+the codebase that expect to iterate through all of the index entries and
+see only files. In fact, these loops expect to see a reference to every
+staged file. One way to handle this is to parse trees to replace a
+sparse-directory entry with all of the files within that tree as the index
+is loaded. However, parsing trees is slower than parsing the index format,
+so that is a slower operation than if we left the index alone. The plan is
+to make all of these integrations "sparse aware" so this expansion through
+tree parsing is unnecessary and they use fewer resources than when using a
+full index.
+
+The implementation plan below follows four phases to slowly integrate with
+the sparse-index. The intention is to incrementally update Git commands to
+interact safely with the sparse-index without significant slowdowns. This
+may not always be possible, but the hope is that the primary commands that
+users need in their daily work are dramatically improved.
+
+Phase I: Format and initial speedups
+------------------------------------
+
+During this phase, Git learns to enable the sparse-index and safely parse
+one. Protections are put in place so that every consumer of the in-memory
+data structure can operate with its current assumption of every file at
+`HEAD`.
+
+At first, every index parse will call a helper method,
+`ensure_full_index()`, which scans the index for sparse-directory entries
+(pointing to trees) and replaces them with the full list of paths (with
+blob contents) by parsing tree objects. This will be slower in all cases.
+The only noticeable change in behavior will be that the serialized index
+file contains sparse-directory entries.
+
+To start, we use a new required index extension, `sdir`, to allow
+inserting sparse-directory entries into indexes with file format
+versions 2, 3, and 4. This prevents Git versions that do not understand
+the sparse-index from operating on one, while allowing tools that do not
+understand the sparse-index to operate on repositories as long as they do
+not interact with the index. A new format, index v5, will be introduced
+that includes sparse-directory entries by default. It might also
+introduce other features that have been considered for improving the
+index, as well.
+
+Next, consumers of the index will be guarded against operating on a
+sparse-index by inserting calls to `ensure_full_index()` or
+`expand_index_to_path()`. If a specific path is requested, then those will
+be protected from within the `index_file_exists()` and `index_name_pos()`
+API calls: they will call `ensure_full_index()` if necessary. The
+intention here is to preserve existing behavior when interacting with a
+sparse-checkout. We don't want a change to happen by accident, without
+tests. Many of these locations may not need any change before removing the
+guards, but we should not do so without tests to ensure the expected
+behavior happens.
+
+It may be desirable to _change_ the behavior of some commands in the
+presence of a sparse index or more generally in any sparse-checkout
+scenario. In such cases, these should be carefully communicated and
+tested. No such behavior changes are intended during this phase.
+
+During a scan of the codebase, not every iteration of the cache entries
+needs an `ensure_full_index()` check. The basic reasons include:
+
+1. The loop is scanning for entries with non-zero stage. These entries
+ are not collapsed into a sparse-directory entry.
+
+2. The loop is scanning for submodules. These entries are not collapsed
+ into a sparse-directory entry.
+
+3. The loop is part of the index API, especially around reading or
+ writing the format.
+
+4. The loop is checking for correct order of cache entries and that is
+ correct if and only if the sparse-directory entries are in the correct
+ location.
+
+5. The loop ignores entries with the `SKIP_WORKTREE` bit set, or is
+ otherwise already aware of sparse directory entries.
+
+6. The sparse-index is disabled at this point when using the split-index
+ feature, so no effort is made to protect the split-index API.
+
+Even after inserting these guards, we will keep expanding sparse-indexes
+for most Git commands using the `command_requires_full_index` repository
+setting. This setting will be on by default and disabled one builtin at a
+time until we have sufficient confidence that all of the index operations
+are properly guarded.
+
+To complete this phase, the commands `git status` and `git add` will be
+integrated with the sparse-index so that they operate with O(Populated)
+performance. They will be carefully tested for operations within and
+outside the sparse-checkout definition.
+
+Phase II: Careful integrations
+------------------------------
+
+This phase focuses on ensuring that all index extensions and APIs work
+well with a sparse-index. This requires significant increases to our test
+coverage, especially for operations that interact with the working
+directory outside of the sparse-checkout definition. Some of these
+behaviors may not be the desirable ones, such as some tests already
+marked for failure in `t1092-sparse-checkout-compatibility.sh`.
+
+The index extensions that may require special integrations are:
+
+* FS Monitor
+* Untracked cache
+
+While integrating with these features, we should look for patterns that
+might lead to better APIs for interacting with the index. Coalescing
+common usage patterns into an API call can reduce the number of places
+where sparse-directories need to be handled carefully.
+
+Phase III: Important command speedups
+-------------------------------------
+
+At this point, the patterns for testing and implementing sparse-directory
+logic should be relatively stable. This phase focuses on updating some of
+the most common builtins that use the index to operate as O(Populated).
+Here is a potential list of commands that could be valuable to integrate
+at this point:
+
+* `git commit`
+* `git checkout`
+* `git merge`
+* `git rebase`
+
+Hopefully, commands such as `git merge` and `git rebase` can benefit
+instead from merge algorithms that do not use the index as a data
+structure, such as the merge-ORT strategy. As these topics mature, we
+may enable the ORT strategy by default for repositories using the
+sparse-index feature.
+
+Along with `git status` and `git add`, these commands cover the majority
+of users' interactions with the working directory. In addition, we can
+integrate with these commands:
+
+* `git grep`
+* `git rm`
+
+These have been proposed as some whose behavior could change when in a
+repo with a sparse-checkout definition. It would be good to include this
+behavior automatically when using a sparse-index. Some clarity is needed
+to make the behavior switch clear to the user.
+
+This phase is the first where parallel work might be possible without too
+much conflicts between topics.
+
+Phase IV: The long tail
+-----------------------
+
+This last phase is less a "phase" and more "the new normal" after all of
+the previous work.
+
+To start, the `command_requires_full_index` option could be removed in
+favor of expanding only when hitting an API guard.
+
+There are many Git commands that could use special attention to operate as
+O(Populated), while some might be so rare that it is acceptable to leave
+them with additional overhead when a sparse-index is present.
+
+Here are some commands that might be useful to update:
+
+* `git sparse-checkout set`
+* `git am`
+* `git clean`
+* `git stash`
diff --git a/Documentation/technical/unit-tests.txt b/Documentation/technical/unit-tests.txt
new file mode 100644
index 0000000..206037f
--- /dev/null
+++ b/Documentation/technical/unit-tests.txt
@@ -0,0 +1,240 @@
+= Unit Testing
+
+In our current testing environment, we spend a significant amount of effort
+crafting end-to-end tests for error conditions that could easily be captured by
+unit tests (or we simply forgo some hard-to-setup and rare error conditions).
+Unit tests additionally provide stability to the codebase and can simplify
+debugging through isolation. Writing unit tests in pure C, rather than with our
+current shell/test-tool helper setup, simplifies test setup, simplifies passing
+data around (no shell-isms required), and reduces testing runtime by not
+spawning a separate process for every test invocation.
+
+We believe that a large body of unit tests, living alongside the existing test
+suite, will improve code quality for the Git project.
+
+== Definitions
+
+For the purposes of this document, we'll use *test framework* to refer to
+projects that support writing test cases and running tests within the context
+of a single executable. *Test harness* will refer to projects that manage
+running multiple executables (each of which may contain multiple test cases) and
+aggregating their results.
+
+In reality, these terms are not strictly defined, and many of the projects
+discussed below contain features from both categories.
+
+For now, we will evaluate projects solely on their framework features. Since we
+are relying on having TAP output (see below), we can assume that any framework
+can be made to work with a harness that we can choose later.
+
+
+== Summary
+
+We believe the best way forward is to implement a custom TAP framework for the
+Git project. We use a version of the framework originally proposed in
+https://lore.kernel.org/git/c902a166-98ce-afba-93f2-ea6027557176@gmail.com/[1].
+
+See the <<framework-selection,Framework Selection>> section below for the
+rationale behind this decision.
+
+
+== Choosing a test harness
+
+During upstream discussion, it was occasionally noted that `prove` provides many
+convenient features, such as scheduling slower tests first, or re-running
+previously failed tests.
+
+While we already support the use of `prove` as a test harness for the shell
+tests, it is not strictly required. The t/Makefile allows running shell tests
+directly (though with interleaved output if parallelism is enabled). Git
+developers who wish to use `prove` as a more advanced harness can do so by
+setting DEFAULT_TEST_TARGET=prove in their config.mak.
+
+We will follow a similar approach for unit tests: by default the test
+executables will be run directly from the t/Makefile, but `prove` can be
+configured with DEFAULT_UNIT_TEST_TARGET=prove.
+
+
+[[framework-selection]]
+== Framework selection
+
+There are a variety of features we can use to rank the candidate frameworks, and
+those features have different priorities:
+
+* Critical features: we probably won't consider a framework without these
+** Can we legally / easily use the project?
+*** <<license,License>>
+*** <<vendorable-or-ubiquitous,Vendorable or ubiquitous>>
+*** <<maintainable-extensible,Maintainable / extensible>>
+*** <<major-platform-support,Major platform support>>
+** Does the project support our bare-minimum needs?
+*** <<tap-support,TAP support>>
+*** <<diagnostic-output,Diagnostic output>>
+*** <<runtime-skippable-tests,Runtime-skippable tests>>
+* Nice-to-have features:
+** <<parallel-execution,Parallel execution>>
+** <<mock-support,Mock support>>
+** <<signal-error-handling,Signal & error-handling>>
+* Tie-breaker stats
+** <<project-kloc,Project KLOC>>
+** <<adoption,Adoption>>
+
+[[license]]
+=== License
+
+We must be able to legally use the framework in connection with Git. As Git is
+licensed only under GPLv2, we must eliminate any LGPLv3, GPLv3, or Apache 2.0
+projects.
+
+[[vendorable-or-ubiquitous]]
+=== Vendorable or ubiquitous
+
+We want to avoid forcing Git developers to install new tools just to run unit
+tests. Any prospective frameworks and harnesses must either be vendorable
+(meaning, we can copy their source directly into Git's repository), or so
+ubiquitous that it is reasonable to expect that most developers will have the
+tools installed already.
+
+[[maintainable-extensible]]
+=== Maintainable / extensible
+
+It is unlikely that any pre-existing project perfectly fits our needs, so any
+project we select will need to be actively maintained and open to accepting
+changes. Alternatively, assuming we are vendoring the source into our repo, it
+must be simple enough that Git developers can feel comfortable making changes as
+needed to our version.
+
+In the comparison table below, "True" means that the framework seems to have
+active developers, that it is simple enough that Git developers can make changes
+to it, and that the project seems open to accepting external contributions (or
+that it is vendorable). "Partial" means that at least one of the above
+conditions holds.
+
+[[major-platform-support]]
+=== Major platform support
+
+At a bare minimum, unit-testing must work on Linux, MacOS, and Windows.
+
+In the comparison table below, "True" means that it works on all three major
+platforms with no issues. "Partial" means that there may be annoyances on one or
+more platforms, but it is still usable in principle.
+
+[[tap-support]]
+=== TAP support
+
+The https://testanything.org/[Test Anything Protocol] is a text-based interface
+that allows tests to communicate with a test harness. It is already used by
+Git's integration test suite. Supporting TAP output is a mandatory feature for
+any prospective test framework.
+
+In the comparison table below, "True" means this is natively supported.
+"Partial" means TAP output must be generated by post-processing the native
+output.
+
+Frameworks that do not have at least Partial support will not be evaluated
+further.
+
+[[diagnostic-output]]
+=== Diagnostic output
+
+When a test case fails, the framework must generate enough diagnostic output to
+help developers find the appropriate test case in source code in order to debug
+the failure.
+
+[[runtime-skippable-tests]]
+=== Runtime-skippable tests
+
+Test authors may wish to skip certain test cases based on runtime circumstances,
+so the framework should support this.
+
+[[parallel-execution]]
+=== Parallel execution
+
+Ideally, we will build up a significant collection of unit test cases, most
+likely split across multiple executables. It will be necessary to run these
+tests in parallel to enable fast develop-test-debug cycles.
+
+In the comparison table below, "True" means that individual test cases within a
+single test executable can be run in parallel. We assume that executable-level
+parallelism can be handled by the test harness.
+
+[[mock-support]]
+=== Mock support
+
+Unit test authors may wish to test code that interacts with objects that may be
+inconvenient to handle in a test (e.g. interacting with a network service).
+Mocking allows test authors to provide a fake implementation of these objects
+for more convenient tests.
+
+[[signal-error-handling]]
+=== Signal & error handling
+
+The test framework should fail gracefully when test cases are themselves buggy
+or when they are interrupted by signals during runtime.
+
+[[project-kloc]]
+=== Project KLOC
+
+The size of the project, in thousands of lines of code as measured by
+https://dwheeler.com/sloccount/[sloccount] (rounded up to the next multiple of
+1,000). As a tie-breaker, we probably prefer a project with fewer LOC.
+
+[[adoption]]
+=== Adoption
+
+As a tie-breaker, we prefer a more widely-used project. We use the number of
+GitHub / GitLab stars to estimate this.
+
+
+=== Comparison
+
+:true: [lime-background]#True#
+:false: [red-background]#False#
+:partial: [yellow-background]#Partial#
+
+:gpl: [lime-background]#GPL v2#
+:isc: [lime-background]#ISC#
+:mit: [lime-background]#MIT#
+:expat: [lime-background]#Expat#
+:lgpl: [lime-background]#LGPL v2.1#
+
+:custom-impl: https://lore.kernel.org/git/c902a166-98ce-afba-93f2-ea6027557176@gmail.com/[Custom Git impl.]
+:greatest: https://github.com/silentbicycle/greatest[Greatest]
+:criterion: https://github.com/Snaipe/Criterion[Criterion]
+:c-tap: https://github.com/rra/c-tap-harness/[C TAP]
+:check: https://libcheck.github.io/check/[Check]
+
+[format="csv",options="header",width="33%",subs="specialcharacters,attributes,quotes,macros"]
+|=====
+Framework,"<<license,License>>","<<vendorable-or-ubiquitous,Vendorable or ubiquitous>>","<<maintainable-extensible,Maintainable / extensible>>","<<major-platform-support,Major platform support>>","<<tap-support,TAP support>>","<<diagnostic-output,Diagnostic output>>","<<runtime--skippable-tests,Runtime- skippable tests>>","<<parallel-execution,Parallel execution>>","<<mock-support,Mock support>>","<<signal-error-handling,Signal & error handling>>","<<project-kloc,Project KLOC>>","<<adoption,Adoption>>"
+{custom-impl},{gpl},{true},{true},{true},{true},{true},{true},{false},{false},{false},1,0
+{greatest},{isc},{true},{partial},{true},{partial},{true},{true},{false},{false},{false},3,1400
+{criterion},{mit},{false},{partial},{true},{true},{true},{true},{true},{false},{true},19,1800
+{c-tap},{expat},{true},{partial},{partial},{true},{false},{true},{false},{false},{false},4,33
+{check},{lgpl},{false},{partial},{true},{true},{true},{false},{false},{false},{true},17,973
+|=====
+
+=== Additional framework candidates
+
+Several suggested frameworks have been eliminated from consideration:
+
+* Incompatible licenses:
+** https://github.com/zorgnax/libtap[libtap] (LGPL v3)
+** https://cmocka.org/[cmocka] (Apache 2.0)
+* Missing source: https://www.kindahl.net/mytap/doc/index.html[MyTap]
+* No TAP support:
+** https://nemequ.github.io/munit/[µnit]
+** https://github.com/google/cmockery[cmockery]
+** https://github.com/lpabon/cmockery2[cmockery2]
+** https://github.com/ThrowTheSwitch/Unity[Unity]
+** https://github.com/siu/minunit[minunit]
+** https://cunit.sourceforge.net/[CUnit]
+
+
+== Milestones
+
+* Add useful tests of library-like code
+* Integrate with
+ https://lore.kernel.org/git/20230502211454.1673000-1-calvinwan@google.com/[stdlib
+ work]
+* Run alongside regular `make test` target