UTF-16与UTF-8兼容吗? [英] Is UTF-16 compatible with UTF-8?

查看:427
本文介绍了UTF-16与UTF-8兼容吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我问Google上面的问题,并发送给 UTF-8之间的区别和UTF-16?不幸的是没有回答问题。



根据我的理解,UTF-8应该是UTF-16的子集,意思是:如果我的代码使用UTF-16,并且我交出了UTF-8编码的字符串,一切都应该正常。相反(期望UTF-8并获得UTF-16)可能会引起问题。



这正确吗?



编辑:为了阐明为什么链接的SO问题不能回答我的问题:当尝试使用 WebClient.DownloadString 处理JSON字符串时,我的问题出现了,因为WebClient使用了错误的编码。我从请求中收到的JSON编码为UTF-8,对我来说,问题是:如果我设置 webClient.Encoding = New System.Text.UnicodeEncoding (又名UTF- 16)我会比较安全吗,即能够处理UTF-8和UTF-16请求结果,还是应该使用 webClient.Encoding = New System.Text.UTF8Encoding

解决方案

目前尚不清楚兼容的含义,所以让我们了解一些基本知识。



Unicode是基本概念,并且已正确实现,UTF-16和UTF-8是编码Unicode的两种不同方法。它们显然是不同的-否则,为什么会有两个不同的概念?



Unicode本身未指定序列化格式。 UTF-8和UTF-16是两种可选的序列化格式。



从某种意义上说它们是兼容的,因为它们可以表示相同的Unicode代码点,但是不兼容



UTF-16还有另外两个不同之处。实际上有两种不同的编码,UTF-16LE和UTF-16BE。这些在字节序方面有所不同。 (UTF-8是字节编码,因此不具有字节序。)旧版UTF-16过去被限制为65,536个可能的字符,该字符数少于Unicode当前包含的字符数。这是通过代理来处理的,但是真正古老和/或损坏的UTF-16实现(正确地标识为UCS-2,而不是真正的 UTF-16)不支持它们。



为便于说明,让我们比较四个不同的代码点。我们选择 U + 0041 U + 00E5 U + 201C U + 1F4A9 ,因为它们很好地说明了这些差异。



U + 0041是7位字符,因此UTF-8仅用一个字节来表示它。 U + 00E5是8位字符,因此UTF-8需要对其进行编码。 U + 1F4A9在基本多语言平面之外,因此UTF-16用替代序列表示它。最后,U + 201C并非以上所有。



以下是我们的候选字符在UTF-8,UTF-16LE和UTF-16BE中的表示形式。

p>

 字符| UTF-8 | UTF-16LE | UTF-16BE | 
---------- + --------------------- + ------------- -------- + --------------------- +
U + 0041 | 0x41 | 0x41 0x00 | 0x00 0x41 |
U + 00E5 | 0xC3 0xA5 | 0xE5 0x00 | 0x00 0xE5 |
U + 201C | 0xE2 0x80 0x9C | 0x1C 0x20 | 0x20 0x1C |
U + 1F4A9 | 0xF0 0x9F 0x92 0xA9 | 0x3D 0xD8 0xA9 0xDC | 0xD8 0x3D 0xDC 0xA9 |

举一个明显的例子,U + 00E5的UTF-8编码代表一个完全不同的字符如果解释为UTF-16(在UTF-16LE中,则为 U + A5C3 ,在UTF-16BE中, U + C3A5

这些是字节值;在ASCII中,0x00是NUL字符(有时表示为 ^ @ ),0x41是大写A,而0xE5未定义;在例如Latin-1 in表示字符å(在Unicode中也方便地为U + 00E5),但是在KOI8-R中,它是西里尔字母Е( U + 0415 ),等适合您的平台和库。关于切线,也请参见 http://utf8everywhere.org/


I asked Google the question above and was sent to Difference between UTF-8 and UTF-16? which unfortunately doesn't answer the question.

From my understanding UTF-8 should be a subset of UTF-16 meaning: if my code uses UTF-16 and I hand in a UTF-8 encoded string everything should always be fine. The other way around (expecting UTF-8 and getting UTF-16) may cause problems.

Is that correct?

EDIT: To clarify why the linked SO question doesn't answer my question: My problem arose when trying to process a JSON string using WebClient.DownloadString, because the WebClient used the wrong encoding. The JSON I received from the request was encoded as UTF-8 and the question for me was: if I set webClient.Encoding = New System.Text.UnicodeEncoding (a.k.a UTF-16) would I be on the safe side, i.e. able to handle UTF-8 and UTF-16 request results, or should I use webClient.Encoding = New System.Text.UTF8Encoding?

解决方案

It's not clear what you mean by "compatible", so let's get some basics out of the way.

Unicode is the underlying concept, and properly implemented, UTF-16 and UTF-8 are two different ways to encode Unicode. They are obviously different -- otherwise, why would there be two different concepts?

Unicode by itself does not specify a serialization format. UTF-8 and UTF-16 are two alternative serialization formats.

They are "compatible" in the sense that they can represent the same Unicode code points, but "incompatible" in that the representations are completely different.

There are two additional twists with UTF-16. There are actually two different encodings, UTF-16LE and UTF-16BE. These differ in endianness. (UTF-8 is a byte encoding, so does not have endianness.) Legacy UTF-16 used to be restricted to 65,536 possible characters, which is less than Unicode currently contains. This is handled with surrogates, but really old and/or broken UTF-16 implementations (properly identified as UCS-2, not "real" UTF-16) do not support them.

For a bit of concretion, let's compare four different code points. We pick U+0041, U+00E5, U+201C, and U+1F4A9, as they illustrate the differences nicely.

U+0041 is a 7-bit character, so UTF-8 represents it simply with a single byte. U+00E5 is an 8-bit character, so UTF-8 needs to encode it. U+1F4A9 is outside the Basic Multilingual Plane, so UTF-16 represents it with a surrogate sequence. Finally, U+201C is none of the above.

Here are the representations of our candidate characters in UTF-8, UTF-16LE, and UTF-16BE.

Character | UTF-8               | UTF-16LE            | UTF-16BE            |
----------+---------------------+---------------------+---------------------+
U+0041    | 0x41                | 0x41 0x00           | 0x00 0x41           |
U+00E5    | 0xC3 0xA5           | 0xE5 0x00           | 0x00 0xE5           |
U+201C    | 0xE2 0x80 0x9C      | 0x1C 0x20           | 0x20 0x1C           |
U+1F4A9   | 0xF0 0x9F 0x92 0xA9 | 0x3D 0xD8 0xA9 0xDC | 0xD8 0x3D 0xDC 0xA9 |

To pick one obvious example, the UTF-8 encoding of U+00E5 would represent a completely different character if interpreted as UTF-16 (in UTF-16LE, it would be U+A5C3, and in UTF-16BE, U+C3A5.) Conversely, many of the UTF-16 codes are not valid UTF-8 sequences at all. So in this sense, UTF-8 and UTF-16 are completely and utterly incompatible.

These are byte values; in ASCII, 0x00 is the NUL character (sometimes represented as ^@), 0x41 is uppercase A, and 0xE5 is undefined; in e.g. Latin-1 in represents the character å (which is also conveniently U+00E5 in Unicode), but in KOI8-R it is the Cyrillic character Е (U+0415), etc.

In modern programming languages, your code should simply use Unicode, and let the language handle the nitty-gritty of encoding it in a way which is suitable for your platform and libraries. On a somewhat tangential note, see also http://utf8everywhere.org/

这篇关于UTF-16与UTF-8兼容吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆