Do a better job at guessing unknown character sets

At least in the kernel development community, we're generally slowly converting to UTF-8 everywhere, and the old default of Latin1 in emails is being supplanted by UTF-8, and it doesn't necessarily show up as such in the mail headers (because, quite frankly, when people send patches around, they want the email client to do as little as humanly possible about the patch) Despite that, it's often the case that email addresses etc still have Latin1, so I've seen emails where this is a mixed bag, with Signed-off parts being copied from email (and containing Latin1 characters), and the rest of the email being a patch in UTF-8. So this suggests a very natural change: if the target character set is utf-8 (the default), and if the source already looks like utf-8, just assume that it doesn't need any conversion at all. Only assume that it needs conversion if it isn't already valid utf-8, in which case we (for historical reasons) will assume it's Latin1. Basically no really _valid_ latin1 will ever look like utf-8, so while this changes our historical behaviour, it doesn't do so in practice, and makes the default behaviour saner for the case where the input was already in proper format. We could do a more fancy guess, of course, but this correctly handled a series of patches I just got from Andrew that had a mixture of Latin1 and UTF-8 (in different emails, but without any character set indication). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
author: Linus Torvalds <torvalds@linux-foundation.org> 2007-07-17 17:34:44 (GMT)
committer: Junio C Hamano <gitster@pobox.com> 2007-07-19 00:01:10 (GMT)
commit: b59d398beab604e577846ef8393735478c1ca3c2 (patch)
tree: 1c21e06640a6601b776ee8f2d207a76a7a31fc04 /builtin-mailinfo.c
parent: ec96e0f6a4244e3bccc745eeb4cb6daa80a347e4 (diff)
download: git-b59d398beab604e577846ef8393735478c1ca3c2.zip
git-b59d398beab604e577846ef8393735478c1ca3c2.tar.gz
git-b59d398beab604e577846ef8393735478c1ca3c2.tar.bz2
1 files changed, 29 insertions, 4 deletions
diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index 489c2c5..a37a4ff 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -499,15 +499,40 @@ static int decode_b_segment(char *in, char *ot, char *ep)
 	return 0;
 }
 
+/*
+ * When there is no known charset, guess.
+ *
+ * Right now we assume that if the target is UTF-8 (the default),
+ * and it already looks like UTF-8 (which includes US-ASCII as its
+ * subset, of course) then that is what it is and there is nothing
+ * to do.
+ *
+ * Otherwise, we default to assuming it is Latin1 for historical
+ * reasons.
+ */
+static const char *guess_charset(const char *line, const char *target_charset)
+{
+	if (is_encoding_utf8(target_charset)) {
+		if (is_utf8(line))
+			return NULL;
+	}
+	return "latin1";
+}
+
 static void convert_to_utf8(char *line, const char *charset)
 {
-	static const char latin_one[] = "latin1";
-	const char *input_charset = *charset ? charset : latin_one;
-	char *out = reencode_string(line, metainfo_charset, input_charset);
+	char *out;
+
+	if (!charset || !*charset) {
+		charset = guess_charset(line, metainfo_charset);
+		if (!charset)
+			return;
+	}
 
+	out = reencode_string(line, metainfo_charset, charset);
 	if (!out)
 		die("cannot convert from %s to %s\n",
-		    input_charset, metainfo_charset);
+		    charset, metainfo_charset);
 	strcpy(line, out);
 	free(out);
 }
author	Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 17:34:44 (GMT)
committer	Junio C Hamano <gitster@pobox.com>	2007-07-19 00:01:10 (GMT)
commit	b59d398beab604e577846ef8393735478c1ca3c2 (patch)
tree	1c21e06640a6601b776ee8f2d207a76a7a31fc04 /builtin-mailinfo.c
parent	ec96e0f6a4244e3bccc745eeb4cb6daa80a347e4 (diff)
download	git-b59d398beab604e577846ef8393735478c1ca3c2.zip git-b59d398beab604e577846ef8393735478c1ca3c2.tar.gz git-b59d398beab604e577846ef8393735478c1ca3c2.tar.bz2