为什么不是Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))== x` [英] Why isn't `Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))==x`

查看:537
本文介绍了为什么不是Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))== x`的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在.NET中,为什么不是这样:

In .NET why isn't it true that:

Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))

返回原始字节数组以获取任意字节数组x?

returns the original byte array for an arbitrary byte array x?

在回答另一个问题时提到,但响应者没有解释原因.

It is mentioned in answer to another question but the responder doesn't explain why.

推荐答案

对于同一代码点,字符编码(特别是UTF8)可能具有不同的形式.

Character encodings (UTF8, specificly) may have different forms for the same code point.

因此,当您转换为字符串并返回时,实际字节可能表示不同(规范)形式.

So when you convert to a string and back, the actual bytes may represent a different (canonical) form.

另请参阅String.Normalize(NormalizationForm.System.Text.NormalizationForm.FormD)

另请参阅:

  • Can I get a single canonical UTF-8 string from a Unicode string?
  • What does .NET's String.Normalize do?
  • NormalizationForm

某些Unicode序列被认为是等效的,因为它们表示相同的字符.例如,以下内容被认为是等效的,因为其中任何一个都可以用来表示ắ":

Some Unicode sequences are considered equivalent because they represent the same character. For example, the following are considered equivalent because any of these can be used to represent "ắ":

"\u1EAF" 
"\u0103\u0301" 
"\u0061\u0306\u0301" 

但是,序数(即二进制)比较将这些序列视为不同,因为它们包含不同的Unicode代码值.在执行序数比较之前,应用程序必须规范化这些字符串以将其分解为基本组成部分.

However, ordinal, that is, binary, comparisons consider these sequences different because they contain different Unicode code values. Before performing ordinal comparisons, applications must normalize these strings to decompose them into their basic components.

该页面带有一个不错的示例,向您显示始终对哪些编码进行规范化

That page comes with a nice sample that shows you what encodings are always normalized

这篇关于为什么不是Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))== x`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆