为什么.net将UTF16编码用于字符串,但将UTF-8用作默认值来保存文件? [英] Why does .net use the UTF16 encoding for string, but uses UTF-8 as default for saving files?

查看:48
本文介绍了为什么.net将UTF16编码用于字符串,但将UTF-8用作默认值来保存文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

解决方案

如果,您很乐意忽略代理对(或者等效地,您的应用可能需要基本多语言平面之外的字符),UTF-16具有一些不错的属性,基本上是因为每个代码单位始终需要两个字节,并且每个代码单位都表示一个BMP字符.

考虑基本类型 char .如果我们使用UTF-8作为内存中表示形式并想处理 all Unicode字符,那应该有多大?最多可以包含4个字节...这意味着我们总是必须分配4个字节.到那时,我们不妨使用UTF-32!

当然,我们可以将UTF-32用作 char 表示形式,但在 string 表示形式中使用UTF-8进行转换.

UTF-16的两个缺点是:

  • 每个Unicode字符的代码单元数是可变的,因为在BMP中并非所有的字符都是.在表情符号变得流行之前,这并没有影响到日常使用的许多应用程序.如今,对于消息传递应用程序之类的人来说,无疑,使用UTF-16的开发人员确实需要了解代理对.
  • 对于纯ASCII(很多文本,至少在西方是这样),它占用的空间是等效UTF-8编码文本的两倍.

(作为一个附带说明,我认为Windows使用Unicode数据使用UTF-16,并且由于互操作性原因,.NET也可以这样做.但这只是将问题推到了一步.)

鉴于代理对的问题,我怀疑如果语言/平台是从头开始设计的,没有互操作性要求(但将其文本处理基于Unicode),那么UTF-16并不是最佳选择.最好使用UTF-8(如果要提高内存效率并且不介意第n个字符的处理复杂性)或UTF-32(反之亦然).(由于类似不同的规范化形式,即使到达第n个字符也会出现问题".文本很难...)

From here

Essentially, string uses the UTF-16 character encoding form

But when saving vs StreamWriter :

This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM),

I've seen this sample (broken link removed):

And it looks like utf8 is smaller for some strings while utf-16 is smaller in some other strings.

  • So why does .net use utf16 as default encoding for string and utf8 for saving files?

Thank you.

p.s. Ive already read the famous article

解决方案

If you're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each.

Consider the primitive type char. If we use UTF-8 as the in-memory representation and want to cope with all Unicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32!

Of course, we could use UTF-32 as the char representation, but UTF-8 in the string representation, converting as we go.

The two disadvantages of UTF-16 are:

  • The number of code units per Unicode character is variable, because not all characters are in the BMP. Until emoji became popular, this didn't affect many apps in day-to-day use. These days, certainly for messaging apps and the like, developers using UTF-16 really need to know about surrogate pairs.
  • For plain ASCII (which a lot of text is, at least in the west) it takes twice the space of the equivalent UTF-8 encoded text.

(As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.)

Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)

这篇关于为什么.net将UTF16编码用于字符串,但将UTF-8用作默认值来保存文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆