summaryrefslogtreecommitdiff
path: root/userdiff.h
diff options
context:
space:
mode:
authorRené Scharfe <l.s.r@web.de>2023-04-06 20:19:11 (GMT)
committerJunio C Hamano <gitster@pobox.com>2023-04-07 14:38:09 (GMT)
commitbe391449542d8a67cb343ec9d0b2f6854d665354 (patch)
tree733f2c434ff3ef6d95c35c744f8ccd80d577b37e /userdiff.h
parent768bb238c4843bf52847773a621de4dffa6b9ab5 (diff)
downloadgit-be391449542d8a67cb343ec9d0b2f6854d665354.zip
git-be391449542d8a67cb343ec9d0b2f6854d665354.tar.gz
git-be391449542d8a67cb343ec9d0b2f6854d665354.tar.bz2
userdiff: support regexec(3) with multi-byte support
Since 1819ad327b (grep: fix multibyte regex handling under macOS, 2022-08-26) we use the system library for all regular expression matching on macOS, not just for git grep. It supports multi-byte strings and rejects invalid multi-byte characters. This broke all built-in userdiff word regexes in UTF-8 locales because they all include such invalid bytes in expressions that are intended to match multi-byte characters without explicit support for that from the regex engine. "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word regexes to match a single non-space or multi-byte character. The \xNN characters are invalid if interpreted as UTF-8 because they have their high bit set, which indicates they are part of a multi-byte character, but they are surrounded by single-byte characters. Replace that expression with "|[^[:space:]]" if the regex engine supports multi-byte matching, as there is no need to have an explicit range for multi-byte characters then. Check for that capability at runtime, because it depends on the locale and thus on environment variables. Construct the full replacement expression at build time and just switch it in if necessary to avoid string manipulation and allocations at runtime. Additionally the word regex for tex contains the expression "[a-zA-Z0-9\x80-\xff]+" with a similarly invalid range. The best replacement with only valid characters that I can come up with is "([a-zA-Z0-9]|[^\x01-\x7f])+". Unlike the original it matches NUL characters, though. Assuming that tex files usually don't contain NUL this should be acceptable. Reported-by: D. Ben Knoble <ben.knoble@gmail.com> Reported-by: Eric Sunshine <sunshine@sunshineco.com> Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Diffstat (limited to 'userdiff.h')
-rw-r--r--userdiff.h1
1 files changed, 1 insertions, 0 deletions
diff --git a/userdiff.h b/userdiff.h
index aee91bc..b09974f 100644
--- a/userdiff.h
+++ b/userdiff.h
@@ -17,6 +17,7 @@ struct userdiff_driver {
int binary;
struct userdiff_funcname funcname;
const char *word_regex;
+ const char *word_regex_multi_byte;
const char *textconv;
struct notes_cache *textconv_cache;
int textconv_want_cache;