“.NET 框架默认使用 UTF-16 编码标准"是什么意思?意思是? [英] What does "The .NET framework uses the UTF-16 encoding standard by default" mean?

查看:38
本文介绍了“.NET 框架默认使用 UTF-16 编码标准"是什么意思?意思是?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的学习指南(针对 70-536 考试)在文本和编码章节中说了两次,就在 IO 章节之后.

My study guide (for 70-536 exam) says this twice in the text and encoding chapter, which is right after the IO chapter.

到目前为止的所有示例都与使用 FileStream 和 StreamWriter 的简单文件访问有关.

All the examples so far are to do with simple file access using FileStream and StreamWriter.

它还说如果您在创建文件时不知道要使用什么编码,请不要指定一个,.NET 将使用 UTF16"和使用 Stream 构造函数重载指定不同的编码".

It aslo says stuff like "If you don't know what encoding to use when you create a file, don't specify one and .NET will use UTF16" and "Specify different encodings using Stream constructor overloads".

不要介意实际重载在 StreamWriter 类上的事实,但是嘿,无论如何.

Never mind the fact that the actual overloads are on the StreamWriter class but hey, whatever.

我现在正在反射器中查看 StreamWriter,我确定我可以看到默认值实际上是 UTF8NoBOM.

I am looking at StreamWriter right now in reflector and I am certain I can see that the default is actaully UTF8NoBOM.

但是这些都没有在勘误表中列出.这是一本旧书(检查了两个版本的错误)所以如果它是错误的,我会认为有人已经拿起它.....

But none of this is listed in the errata. It's an old book (cheked the errat of both editions) so if it was wrong I would have thought someone had picked up on it.....

让我觉得我可能没看懂.

Makes me think maybe I didn't understand it.

所以.....有什么想法它在说什么?其他有默认值的地方?

So.....any ideas what it is talking about? Some other place where there is a default?

这让我很困惑.

推荐答案

UTF-16"是一个令人讨厌的术语,因为它有两个很容易混淆的含义.

"UTF-16" is an annoying term, as it has two meanings which are easily confused.

第一个含义是一系列 16 位代码点.其中大部分直接对应同一个数字的Unicode字符;基本多语言平面(U+10000 以上)之外的字符存储为两个 16 位代码点,每个 代理.

The first meaning is a series of 16-bit codepoints. Most of these correspond directly to the Unicode character of the same number; characters outside the Basic Multilingual Plane (U+10000 upwards) are stored as two 16-bit codepoints, each one of the Surrogates.

许多语言在这个意义上使用 UTF-16 用于内部存储目的,包括作为本机字符串类型.这是诸如.NET(或 Java)使用 UTF-16 作为其默认编码"之类的短语的常见来源..NET 一次访问 UTF-16 字符串的元素 16 位(即,在实现级别,作为 uint16).

Many languages use UTF-16 in this sense for internal storage purposes, including as a native string type. This is the usual source of phrases like ".NET (or Java) uses UTF-16 as its default encoding". .NET is accessing the elements of such a UTF-16 string 16 bits at a time (ie, at the implementation level, as a uint16).

接下来要考虑的是将此类 UTF-16 字符串编码为线性字节,以便存储在文件或网络流中.与往常一样,当您将较大的数字存储到字节中时,有两种可能的编码:little-endian 或 big-endian.因此,您可以使用UTF-16LE"(将 UTF-16 小端编码为字节)或UTF-16BE"(大端编码).

The next thing to consider is the encoding of such a UTF-16 string into linear bytes, for storage in a file or network stream. As always when you store larger numbers into bytes, there are two possible encodings: little-endian or big-endian. So you can use "UTF-16LE", the little-endian encoding of UTF-16 into bytes, or "UTF-16BE", the big-endian encoding.

(UTF-16LE"是更常用的.只是为了增加混乱,Windows 给了它一个极具误导性和模棱两可的编码名称Unicode".实际上,使用 UTF-8 几乎总是更好用于文件存储和网络流,而不是 UTF-16LE/BE.)

("UTF-16LE" is the more commonly used. Just to add more confusion to the flames, Windows gives it the deeply misleading and ambiguous encoding name "Unicode". In reality it is almost always better to use UTF-8 for file storage and network streams than either of UTF-16LE/BE.)

但是如果您不知道一堆字节中是否包含UTF-16LE"或UTF-16BE",您可以使用查看第一个代码点的技巧来计算出来.此代码点,即字节顺序标记 (BOM),仅在以一种方式读取时才有效,因此您不能将一种编码误认为另一种编码.

But if you don't know whether a bunch of bytes contains "UTF-16LE" or "UTF-16BE", you can use the trick of looking at the first code point to work it out. This code point, the Byte Order Mark (BOM), is only valid when read one way around, so you can't mistake one encoding for the other.

这种方法不关心您有什么字节顺序,而是使用 BOM 来表示它,通常在编码名称下引用......UTF-16".

This approach, of not caring what byte order you have but using a BOM to signal it, is usually referred to under the encoding name... "UTF-16".

因此,当有人说UTF-16"时,您无法分辨它们是指一系列 short-int Unicode 代码点,还是将解码为一个的未指定顺序的字节序列.

So, when someone says "UTF-16", you can't tell whether they mean a sequence of short-int Unicode code points, or a sequence of bytes in unspecified order that will decode to one.

(UTF-32"也有同样的问题.)

("UTF-32" has the same problem.)

如果您在创建文件时不知道使用什么编码,请不要指定,.NET 将使用 UTF16

If you don't know what encoding to use when you create a file, don't specify one and .NET will use UTF16

如果那是实际的直接引用,那就是谎言.构造一个没有编码参数的 StreamWriter 明确指定为您提供 UTF-8.

If that's the actual direct quote it is a lie. Constructing a StreamWriter without an encoding argument is explicitly specified to give you UTF-8.

这篇关于“.NET 框架默认使用 UTF-16 编码标准"是什么意思?意思是?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆