summaryrefslogtreecommitdiff
path: root/grep.c
diff options
context:
space:
mode:
authorCarlo Marcelo Arenas Belón <carenas@gmail.com>2023-01-08 15:52:17 (GMT)
committerJunio C Hamano <gitster@pobox.com>2023-01-18 23:24:52 (GMT)
commitacabd2048ee0ee53728100408970ab45a6dab65e (patch)
tree52f0571afd786b179b89b71a42c675758b7993e6 /grep.c
parentc48035d29b4e524aed3a32f0403676f0d9128863 (diff)
downloadgit-acabd2048ee0ee53728100408970ab45a6dab65e.zip
git-acabd2048ee0ee53728100408970ab45a6dab65e.tar.gz
git-acabd2048ee0ee53728100408970ab45a6dab65e.tar.bz2
grep: correctly identify utf-8 characters with \{b,w} in -P
When UTF is enabled for a PCRE match, the corresponding flags are added to the pcre2_compile() call, but PCRE2_UCP wasn't included. This prevents extending the meaning of the character classes to include those new valid characters and therefore result in failed matches for expressions that rely on that extention, for ex: $ git grep -P '\bÆvar' Add PCRE2_UCP so that \w will include Æ and therefore \b could correctly match the beginning of that word. This has an impact on performance that has been estimated to be between 20% to 40% and that is shown through the added performance test. Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com> Acked-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Diffstat (limited to 'grep.c')
-rw-r--r--grep.c2
1 files changed, 1 insertions, 1 deletions
diff --git a/grep.c b/grep.c
index 06eed69..1687f65 100644
--- a/grep.c
+++ b/grep.c
@@ -293,7 +293,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
options |= PCRE2_CASELESS;
}
if (!opt->ignore_locale && is_utf8_locale() && !literal)
- options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+ options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
#ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */