utf8: handle systems that don't write BOM for UTF-16

When serializing UTF-16 (and UTF-32), there are three possible ways to write the stream. One can write the data with a BOM in either big-endian or little-endian format, or one can write the data without a BOM in big-endian format. Most systems' iconv implementations choose to write it with a BOM in some endianness, since this is the most foolproof, and it is resistant to misinterpretation on Windows, where UTF-16 and the little-endian serialization are very common. For compatibility with Windows and to avoid accidental misuse there, Git always wants to write UTF-16 with a BOM, and will refuse to read UTF-16 without it. However, musl's iconv implementation writes UTF-16 without a BOM, relying on the user to interpret it as big-endian. This causes t0028 and the related functionality to fail, since Git won't read the file without a BOM. Add a Makefile and #define knob, ICONV_OMITS_BOM, that can be set if the iconv implementation has this behavior. When set, Git will write a BOM manually for UTF-16 and UTF-32 and then force the data to be written in UTF-16BE or UTF-32BE. We choose big-endian behavior here because the tests use the raw "UTF-16" encoding, which will be big-endian when the implementation requires this knob to be set. Update the tests to detect this case and write test data with an added BOM if necessary. Always write the BOM in the tests in big-endian format, since all iconv implementations that omit a BOM must use big-endian serialization according to the Unicode standard. Preserve the existing behavior for systems which do not have this knob enabled, since they may use optimized implementations, including defaulting to the native endianness, which may improve performance. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
author: brian m. carlson <sandals@crustytoothpaste.net> 2019-02-12 00:52:06 (GMT)
committer: Junio C Hamano <gitster@pobox.com> 2019-02-12 02:20:07 (GMT)
commit: 79444c92943048f9ac62e9311038ebe43f5f0982 (patch)
tree: a83eddab899e28c0823b5a879d70bc1e0782650a /utf8.c
parent: 11ad41d4cb25a41b21a7343c495aceb38d4db4d8 (diff)
download: git-79444c92943048f9ac62e9311038ebe43f5f0982.zip
git-79444c92943048f9ac62e9311038ebe43f5f0982.tar.gz
git-79444c92943048f9ac62e9311038ebe43f5f0982.tar.bz2
1 files changed, 14 insertions, 0 deletions
diff --git a/utf8.c b/utf8.c
index 83824dc..3b42fad 100644
--- a/utf8.c
+++ b/utf8.c
@@ -559,6 +559,10 @@ char *reencode_string_len(const char *in, size_t insz,
 	/*
 	 * For writing, UTF-16 iconv typically creates "UTF-16BE-BOM"
 	 * Some users under Windows want the little endian version
+	 *
+	 * We handle UTF-16 and UTF-32 ourselves only if the platform does not
+	 * provide a BOM (which we require), since we want to match the behavior
+	 * of the system tools and libc as much as possible.
 	 */
 	if (same_utf_encoding("UTF-16LE-BOM", out_encoding)) {
 		bom_str = utf16_le_bom;
@@ -568,6 +572,16 @@ char *reencode_string_len(const char *in, size_t insz,
 		bom_str = utf16_be_bom;
 		bom_len = sizeof(utf16_be_bom);
 		out_encoding = "UTF-16BE";
+#ifdef ICONV_OMITS_BOM
+	} else if (same_utf_encoding("UTF-16", out_encoding)) {
+		bom_str = utf16_be_bom;
+		bom_len = sizeof(utf16_be_bom);
+		out_encoding = "UTF-16BE";
+	} else if (same_utf_encoding("UTF-32", out_encoding)) {
+		bom_str = utf32_be_bom;
+		bom_len = sizeof(utf32_be_bom);
+		out_encoding = "UTF-32BE";
+#endif
 	}
 
 	conv = iconv_open(out_encoding, in_encoding);
author	brian m. carlson <sandals@crustytoothpaste.net>	2019-02-12 00:52:06 (GMT)
committer	Junio C Hamano <gitster@pobox.com>	2019-02-12 02:20:07 (GMT)
commit	79444c92943048f9ac62e9311038ebe43f5f0982 (patch)
tree	a83eddab899e28c0823b5a879d70bc1e0782650a /utf8.c
parent	11ad41d4cb25a41b21a7343c495aceb38d4db4d8 (diff)
download	git-79444c92943048f9ac62e9311038ebe43f5f0982.zip git-79444c92943048f9ac62e9311038ebe43f5f0982.tar.gz git-79444c92943048f9ac62e9311038ebe43f5f0982.tar.bz2