path: root/utf8.h
AgeCommit message (Collapse)Author
2019-05-05*.[ch]: remove extern from function declarations using spatchDenton Liu
There has been a push to remove extern from function declarations. Remove some instances of "extern" for function declarations which are caught by Coccinelle. Note that Coccinelle has some difficulty with processing functions with `__attribute__` or varargs so some `extern` declarations are left behind to be dealt with in a future patch. This was the Coccinelle patch used: @@ type T; identifier f; @@ - extern T f(...); and it was run with: $ git ls-files \*.{c,h} | grep -v ^compat/ | xargs spatch --sp-file contrib/coccinelle/noextern.cocci --in-place Files under `compat/` are intentionally excluded as some are directly copied from external sources and we should avoid churning them as much as possible. Signed-off-by: Denton Liu <> Signed-off-by: Junio C Hamano <>
2019-01-31Support working-tree-encoding "UTF-16LE-BOM"Torsten Bögershausen
Users who want UTF-16 files in the working tree set the .gitattributes like this: test.txt working-tree-encoding=UTF-16 The unicode standard itself defines 3 allowed ways how to encode UTF-16. The following 3 versions convert all back to 'g' 'i' 't' in UTF-8: a) UTF-16, without BOM, big endian: $ printf "\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c 0000000 g i t b) UTF-16, with BOM, little endian: $ printf "\377\376g\000i\000t\000" | iconv -f UTF-16 -t UTF-8 | od -c 0000000 g i t c) UTF-16, with BOM, big endian: $ printf "\376\377\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c 0000000 g i t Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the working tree. After a checkout, the resulting file has a BOM and is encoded in "UTF-16", in the version (c) above. This is what iconv generates, more details follow below. iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE: d) UTF-16 $ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c 0000000 376 377 \0 g \0 i \0 t e) UTF-16LE $ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c 0000000 g \0 i \0 t \0 f) UTF-16BE $ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c 0000000 \0 g \0 i \0 t There is no way to generate version (b) from above in a Git working tree, but that is what some applications need. (All fully unicode aware applications should be able to read all 3 variants, but in practise we are not there yet). When producing UTF-16 as an output, iconv generates the big endian version with a BOM. (big endian is probably chosen for historical reasons). iconv can produce UTF-16 files with little endianess by using "UTF-16LE" as encoding, and that file does not have a BOM. Not all users (especially under Windows) are happy with this. Some tools are not fully unicode aware and can only handle version (b). Today there is no way to produce version (b) with iconv (or libiconv). Looking into the history of iconv, it seems as if version (c) will be used in all future iconv versions (for compatibility reasons). Solve this dilemma and introduce a Git-specific "UTF-16LE-BOM". libiconv can not handle the encoding, so Git pick it up, handles the BOM and uses libiconv to convert the rest of the stream. (UTF-16BE-BOM is added for consistency) Rported-by: Adrián Gimeno Balaguer <> Signed-off-by: Torsten Bögershausen <> Signed-off-by: Junio C Hamano <>
2018-08-20Merge branch 'en/incl-forward-decl'Junio C Hamano
Code hygiene improvement for the header files. * en/incl-forward-decl: Remove forward declaration of an enum compat/precompose_utf8.h: use more common include guard style urlmatch.h: fix include guard Move definition of enum branch_track from cache.h to branch.h alloc: make allocate_alloc_state and clear_alloc_state more consistent Add missing includes and forward declarations
2018-08-15Add missing includes and forward declarationsElijah Newren
I looped over the toplevel header files, creating a temporary two-line C program for each consisting of #include "git-compat-util.h" #include $HEADER This patch is the result of manually fixing errors in compiling those tiny programs. Signed-off-by: Elijah Newren <> Signed-off-by: Junio C Hamano <>
2018-07-24reencode_string: use size_t for string lengthsJeff King
The iconv interface takes a size_t, which is the appropriate type for an in-memory buffer. But our reencode_string_* functions use integers, meaning we may get confusing results when the sizes exceed INT_MAX. Let's use size_t consistently. Signed-off-by: Jeff King <> Signed-off-by: Junio C Hamano <>
2018-05-29Sync with Git 2.17.1Junio C Hamano
* maint: (25 commits) Git 2.17.1 Git 2.16.4 Git 2.15.2 Git 2.14.4 Git 2.13.7 fsck: complain when .gitmodules is a symlink index-pack: check .gitmodules files with --strict unpack-objects: call fsck_finish() after fscking objects fsck: call fsck_finish() after fscking objects fsck: check .gitmodules content fsck: handle promisor objects in .gitmodules check fsck: detect gitmodules files fsck: actually fsck blob data fsck: simplify ".git" check index-pack: make fsck error message more specific verify_path: disallow symlinks in .gitmodules update-index: stat updated files earlier verify_dotfile: mention case-insensitivity in comment verify_path: drop clever fallthrough skip_prefix: add case-insensitive variant ...
2018-05-22is_hfs_dotgit: match other .git filesJeff King
Both verify_path() and fsck match ".git", ".GIT", and other variants specific to HFS+. Let's allow matching other special files like ".gitmodules", which we'll later use to enforce extra restrictions via verify_path() and fsck. Signed-off-by: Jeff King <>
2018-04-16utf8: add function to detect a missing UTF-16/32 BOMLars Schneider
If the endianness is not defined in the encoding name, then let's be strict and require a BOM to avoid any encoding confusion. The is_missing_required_utf_bom() function returns true if a required BOM is missing. The Unicode standard instructs to assume big-endian if there in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used in HTML5 recommends to assume little-endian to "deal with deployed content" [3]. Strictly requiring a BOM seems to be the safest option for content in Git. This function is used in a subsequent commit. [1] [2] Section 3.10, D98, page 132 [3] Signed-off-by: Lars Schneider <> Signed-off-by: Junio C Hamano <>
2018-04-16utf8: add function to detect prohibited UTF-16/32 BOMLars Schneider
Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used [1]. The function returns true if this is the case. This function is used in a subsequent commit. [1] Signed-off-by: Lars Schneider <> Signed-off-by: Junio C Hamano <>
2016-05-06typofix: assorted typofixes in comments, documentation and messagesLi Peng
Many instances of duplicate words (e.g. "the the path") and a few typoes are fixed, originally in multiple patches. wildmatch: fix duplicate words of "the" t: fix duplicate words of "output" transport-helper: fix duplicate words of "read" fix duplicate words of "return" path: fix duplicate words of "look" pack-protocol.txt: fix duplicate words of "the" precompose-utf8: fix typo of "sequences" split-index: fix typo worktree.c: fix typo remote-ext: fix typo utf8: fix duplicate words of "the" git-cvsserver: fix duplicate words Signed-off-by: Li Peng <> Signed-off-by: Junio C Hamano <>
2015-09-17utf8: add function to align a string into given strbufKarthik Nayak
Add strbuf_utf8_align() which will align a given string into a strbuf as per given align_type and width. If the width is greater than the string length then no alignment is performed. Helped-by: Eric Sunshine <> Mentored-by: Christian Couder <> Mentored-by: Matthieu Moy <> Signed-off-by: Karthik Nayak <> Signed-off-by: Junio C Hamano <>
2015-06-24Merge branch 'es/utf8-stupid-compiler-workaround'Junio C Hamano
A compilation workaround. * es/utf8-stupid-compiler-workaround: utf8: NO_ICONV: silence uninitialized variable warning
2015-06-05utf8: NO_ICONV: silence uninitialized variable warningEric Sunshine
The last argument of reencode_string_len() is an 'int *' which is assigned the length of the converted string. When NO_ICONV is defined, however, reencode_string_len() is stubbed out by the macro: #define reencode_string_len(a,b,c,d,e) NULL which never assigns a value to the final argument. When called like this: int n; char *s = reencode_string_len(..., &n); if (s) do_something(s, n); some compilers complain that 'n' is used uninitialized within the conditional. Signed-off-by: Eric Sunshine <> Reviewed-by: Jeff King <> Signed-off-by: Junio C Hamano <>
2015-04-16utf8-bom: introduce skip_utf8_bom() helperJunio C Hamano
With the recent change to ignore the UTF8 BOM at the beginning of .gitignore files, we now have two codepaths that do such a skipping (the other one is for reading the configuration files). Introduce utf8_bom[] constant string and skip_utf8_bom() helper and teach .gitignore code how to use it. Signed-off-by: Junio C Hamano <>
2014-12-17utf8: add is_hfs_dotgit() helperJeff King
We do not allow paths with a ".git" component to be added to the index, as that would mean repository contents could overwrite our repository files. However, asking "is this path the same as .git" is not as simple as strcmp() on some filesystems. HFS+'s case-folding does more than just fold uppercase into lowercase (which we already handle with strcasecmp). It may also skip past certain "ignored" Unicode code points, so that (for example) ".gi\u200ct" is mapped ot ".git". The full list of folds can be found in the tables at: Implementing a full "is this path the same as that path" comparison would require us importing the whole set of tables. However, what we want to do is much simpler: we only care about checking ".git". We know that 'G' is the only thing that folds to 'g', and so on, so we really only need to deal with the set of ignored code points, which is much smaller. Signed-off-by: Jeff King <> Signed-off-by: Junio C Hamano <>
2013-07-10add missing "format" function attributesJeff King
For most of our functions that take printf-like formats, we use gcc's __attribute__((format)) to get compiler warnings when the functions are misused. Let's give a few more functions the same protection. In most cases, the annotations do not uncover any actual bugs; the only code change needed is that we passed a size_t to transfer_debug, which expected an int. Since we expect the passed-in value to be a relatively small buffer size (and cast a similar value to int directly below), we can just cast away the problem. Signed-off-by: Jeff King <> Signed-off-by: Junio C Hamano <>
2013-04-18pretty: support %>> that steal trailing spacesNguyễn Thái Ngọc Duy
This is pretty useful in `%<(100)%s%Cred%>(20)% an' where %s does not use up all 100 columns and %an needs more than 20 columns. By replacing %>(20) with %>>(20), %an can steal spaces from %s. %>> understands escape sequences, so %Cred does not stop it from stealing spaces in %<(100). Signed-off-by: Nguyễn Thái Ngọc Duy <> Signed-off-by: Junio C Hamano <>
2013-04-18pretty: support truncating in %>, %< and %><Nguyễn Thái Ngọc Duy
%>(N,trunc) truncates the right part after N columns and replace the last two letters with "..". ltrunc does the same on the left. mtrunc cuts the middle out. Signed-off-by: Nguyễn Thái Ngọc Duy <> Signed-off-by: Junio C Hamano <>
2013-04-18utf8.c: add reencode_string_len() that can handle NULs in stringNguyễn Thái Ngọc Duy
Signed-off-by: Nguyễn Thái Ngọc Duy <> Signed-off-by: Junio C Hamano <>
2013-04-18utf8.c: add utf8_strnwidth() with the ability to skip ansi sequencesNguyễn Thái Ngọc Duy
Signed-off-by: Nguyễn Thái Ngọc Duy <> Signed-off-by: Junio C Hamano <>
2013-03-25Merge branch 'ks/rfc2047-one-char-at-a-time'Junio C Hamano
When "format-patch" quoted a non-ascii strings on the header files, it incorrectly applied rfc2047 and chopped a single character in the middle of it. * ks/rfc2047-one-char-at-a-time: format-patch: RFC 2047 says multi-octet character may not be split
2013-03-09format-patch: RFC 2047 says multi-octet character may not be splitKirill Smelkov
Even though an earlier attempt (bafc478..41dd00bad) cleaned up RFC 2047 encoding, pretty.c::add_rfc2047() still decides where to split the output line by going through the input one byte at a time, and potentially splits a character in the middle. A subject line may end up showing like this: ".... fö?? bar". (instead of ".... föö bar".) if split incorrectly. RFC 2047, section 5 (3) explicitly forbids such beaviour Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded- word's. that means that e.g. for Subject: .... föö bar encoding Subject: =?UTF-8?q?....=20f=C3=B6=C3=B6?= =?UTF-8?q?=20bar?= is correct, and Subject: =?UTF-8?q?....=20f=C3=B6=C3?= <-- NOTE ö is broken here =?UTF-8?q?=B6=20bar?= is not, because "ö" character UTF-8 encoding C3 B6 is split here across adjacent encoded words. To fix the problem, make the loop grab one _character_ at a time and determine its output length to see where to break the output line. Note that this version only knows about UTF-8, but the logic to grab one character is abstracted out in mbs_chrlen() function to make it possible to extend it to other encodings with the help of iconv in the future. Signed-off-by: Kirill Smelkov <> Signed-off-by: Junio C Hamano <>
2013-02-14Merge branch 'jx/utf8-printf-width'Junio C Hamano
Use a new helper that prints a message and counts its display width to align the help messages parse-options produces. * jx/utf8-printf-width: Add utf8_fprintf helper that returns correct number of columns
2013-02-11Add utf8_fprintf helper that returns correct number of columnsJiang Xin
Since command usages can be translated, they may include utf-8 encoded strings, and the output in console may not align well any more. This is because strlen() is different from strwidth() on utf-8 strings. A wrapper utf8_fprintf() can help to return the correct number of columns required. Signed-off-by: Jiang Xin <> Signed-off-by: Nguyễn Thái Ngọc Duy <> Reviewed-by: Torsten Bögershausen <> Signed-off-by: Junio C Hamano <>
2013-01-02Merge branch 'sp/shortlog-missing-lf'Junio C Hamano
When a line to be wrapped has a solid run of non space characters whose length exactly is the wrap width, "git shortlog -w" failed to add a newline after such a line. * sp/shortlog-missing-lf: strbuf_add_wrapped*(): Remove unused return value shortlog: fix wrapping lines of wraplen
2012-12-11strbuf_add_wrapped*(): Remove unused return valueSteffen Prohaska
Since shortlog isn't using the return value anymore (see previous commit), the functions can be changed to void. Signed-off-by: Steffen Prohaska <> Signed-off-by: Junio C Hamano <>
2012-11-04reencode_string(): introduce and use same_encoding()Junio C Hamano
Callers of reencode_string() that re-encodes a string from one encoding to another all used ad-hoc way to bypass the case where the input and the output encodings are the same. Some did strcmp(), some did strcasecmp(), yet some others when converting to UTF-8 used is_encoding_utf8(). Introduce same_encoding() helper function to make these callers use the same logic. Notably, is_encoding_utf8() has a work-around for common misconfiguration to use "utf8" to name UTF-8 encoding, which does not match "UTF-8" hence strcasecmp() would not consider the same. Make use of it in this helper function. Signed-off-by: Junio C Hamano <>
2012-07-09git on Mac OS and precomposed unicodeTorsten Bögershausen
Mac OS X mangles file names containing unicode on file systems HFS+, VFAT or SAMBA. When a file using unicode code points outside ASCII is created on a HFS+ drive, the file name is converted into decomposed unicode and written to disk. No conversion is done if the file name is already decomposed unicode. Calling open("\xc3\x84", ...) with a precomposed "Ä" yields the same result as open("\x41\xcc\x88",...) with a decomposed "Ä". As a consequence, readdir() returns the file names in decomposed unicode, even if the user expects precomposed unicode. Unlike on HFS+, Mac OS X stores files on a VFAT drive (e.g. an USB drive) in precomposed unicode, but readdir() still returns file names in decomposed unicode. When a git repository is stored on a network share using SAMBA, file names are send over the wire and written to disk on the remote system in precomposed unicode, but Mac OS X readdir() returns decomposed unicode to be compatible with its behaviour on HFS+ and VFAT. The unicode decomposition causes many problems: - The names "git add" and other commands get from the end user may often be precomposed form (the decomposed form is not easily input from the keyboard), but when the commands read from the filesystem to see what it is going to update the index with already is on the filesystem, readdir() will give decomposed form, which is different. - Similarly "git log", "git mv" and all other commands that need to compare pathnames found on the command line (often but not always precomposed form; a command line input resulting from globbing may be in decomposed) with pathnames found in the tree objects (should be precomposed form to be compatible with other systems and for consistency in general). - The same for names stored in the index, which should be precomposed, that may need to be compared with the names read from readdir(). NFS mounted from Linux is fully transparent and does not suffer from the above. As Mac OS X treats precomposed and decomposed file names as equal, we can - wrap readdir() on Mac OS X to return the precomposed form, and - normalize decomposed form given from the command line also to the precomposed form, to ensure that all pathnames used in Git are always in the precomposed form. This behaviour can be requested by setting "core.precomposedunicode" configuration variable to true. The code in compat/precomposed_utf8.c implements basically 4 new functions: precomposed_utf8_opendir(), precomposed_utf8_readdir(), precomposed_utf8_closedir() and precompose_argv(). The first three are to wrap opendir(3), readdir(3), and closedir(3) functions. The argv[] conversion allows to use the TAB filename completion done by the shell on command line. It tolerates other tools which use readdir() to feed decomposed file names into git. When creating a new git repository with "git init" or "git clone", "core.precomposedunicode" will be set "false". The user needs to activate this feature manually. She typically sets core.precomposedunicode to "true" on HFS and VFAT, or file systems mounted via SAMBA. Helped-by: Junio C Hamano <> Signed-off-by: Torsten Bögershausen <> Signed-off-by: Junio C Hamano <>
2011-02-23strbuf: add fixed-length version of add_wrapped_textJeff King
The function strbuf_add_wrapped_text takes a NUL-terminated string. This makes it annoying to wrap strings we have as a pointer and a length. Refactoring strbuf_add_wrapped_text and all of its sub-functions to handle fixed-length strings turned out to be really ugly. So this implementation is lame; it just strdups the text and operates on the NUL-terminated version. This should be fine as the strings we are wrapping are generally pretty short. If it becomes a problem, we can optimize later. Signed-off-by: Jeff King <> Signed-off-by: Junio C Hamano <>
2010-03-02Merge branch 'rs/optim-text-wrap'Junio C Hamano
* rs/optim-text-wrap: utf8.c: speculatively assume utf-8 in strbuf_add_wrapped_text() utf8.c: remove strbuf_write() utf8.c: remove print_spaces() utf8.c: remove print_wrapped_text()
2010-02-20utf8.c: remove print_wrapped_text()René Scharfe
strbuf_add_wrapped_text() is called only from print_wrapped_text() without a strbuf (in which case it writes its results to stdout). At its only callsite, supply a strbuf, call strbuf_add_wrapped_text() directly and remove the wrapper function. Signed-off-by: Rene Scharfe <> Signed-off-by: Junio C Hamano <>
2010-01-12utf8.c: mark file-local function staticJunio C Hamano
Signed-off-by: Junio C Hamano <>
2009-10-19Add strbuf_add_wrapped_text() to utf8.[ch]Johannes Schindelin
The newly added function can rewrap text according to a given first-line indent, other-indent and text width. Signed-off-by: Johannes Schindelin <>
2009-02-05utf8: add utf8_strwidth()Geoffrey Thomas
I'm about to use this pattern more than once, so make it a common function. Signed-off-by: Geoffrey Thomas <> Signed-off-by: Junio C Hamano <>
2008-01-07utf8_width(): allow non NUL-terminated inputJunio C Hamano
The original interface assumed that the input string is always terminated with a NUL, but that wasn't too useful. Signed-off-by: Junio C Hamano <>
2008-01-07utf8: pick_one_utf8_char()Junio C Hamano
utf8_width() function was doing two different things. To pick a valid character from UTF-8 stream, and compute the display width of that character. This splits the former to a separate function pick_one_utf8_char(). Signed-off-by: Junio C Hamano <>
2007-02-28Actually make print_wrapped_text() usefulJohannes Schindelin
Now, it returns the current column, does not add a newline, and you can pass a negative indent, to indicate that the indent was already printed. With this, you can actually continue in the middle of a paragraph, not having to print everything into a buffer first. Signed-off-by: Johannes Schindelin <> Signed-off-by: Junio C Hamano <>
2006-12-30commit-tree: cope with different ways "utf-8" can be spelled.Junio C Hamano
People can spell config.commitencoding differently from what we internally have ("utf-8") to mean UTF-8. Try to accept them and treat them equally. Signed-off-by: Junio C Hamano <>
2006-12-26Move encoding conversion routine out of mailinfo to utf8.cJunio C Hamano
This moves the body of convert_to_utf8() routine used in mailinfo to the utf8.c i18n library. Signed-off-by: Junio C Hamano <>
2006-12-24commit-tree: encourage UTF-8 commit messages.Johannes Schindelin
Introduce is_utf() to check if a text looks like it is encoded in UTF-8, utf8_width() to count display width, and implements print_wrapped_text() using them. git-commit-tree warns if the commit message does not minimally conform to the UTF-8 encoding when i18n.commitencoding is either unset, or set to "utf-8". Signed-off-by: Junio C Hamano <>