编码时字符的字节大小 [英] Byte size of characters when encoding

查看：76 发布时间：2019/6/4 21:17:03 net

本文介绍了编码时字符的字节大小的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

法UnicodeEncoding.GetMaxByteCount（charCount）返回charCount * 2.

法UTF8Encoding.GetMaxByteCount（charCount）返回charCount * 4.

但为什么呢？

看：

/ *

每个Unicode字符string由Unicode标量值定义，

也称为...

索引是Char中的位置，而不是Unicode字符的位置串。

指数是一个从零开始的非负数，从字符串中的第一个位置

开始，即索引位置为零。连续索引值可能

不对应于连续的Unicode字符，因为Unicode字符

可能被编码为多个Char。要使用每个Unicode字符

而不是每个Char，请使用System.Globalization.StringInfo类。

* /

使用UTF-8编码时，struct Char的一个实例只占1/2,1,1

1/2,2个字节？

不是吗？

因此UTF8Encoding.GetMaxByteCount（charCount）必须返回charCount *

2.

因为charCount表示struct Char的实例计数。

或不是吗？可能是它意味着Unicode字符的数量？

如果没有，那么UnicodeEncoding.GetMaxByteCount（charCount）必须返回

charCount * 4.

这种方法并不适合彼此。

解决方案

Vladimir写道：

方法UnicodeEncoding。 GetMaxByteCount（charCount）返回charCount * 2.
方法UTF8Encoding.GetMaxByteCount（charCount）返回charCount * 4.

但为什么呢？

.NET中的字符串已经是Unicode编码的。因此，如果您将

字符串编码为一个字节数组，则每个字符都会得到字节数。

但是，对于UTF8编码，可以编码单个Unicode字符

在最坏的情况下最多使用4个字节。如果字符串碰巧只包含需要

4字节编码的字符，则charCount * 4只是最糟糕的情况

场景。

看：

/ *
字符串中的每个Unicode字符都由Unicode标量值定义，
也称为...

索引是的位置字符串中的字符，而不是Unicode字符。
索引是一个从零开始的非负数，从字符串中的第一个位置开始，即索引位置为零。连续索引值可能与连续的Unicode字符不对应，因为Unicode字符可能被编码为多个Char。要使用每个Unicode字符而不是每个Char，请使用System.Globalization.StringInfo类。
* /

使用UTF-8编码，struct Char的一个实例只能占用1 / 2,1,1
1/2，2个字节？
不是吗？
因此UTF8Encoding.GetMaxByteCount（charCount）必须返回charCount *
2。
因为charCount意味着结构Char的实例计数。
或不？可能是它意味着Unicode字符的数量？
如果没有，那么UnicodeEncoding.GetMaxByteCount（charCount）必须返回
charCount * 4.

这种方法不适合彼此。 br />

-

mikeb

> >方法UnicodeEncoding.GetMaxByteCount（charCount）返回charCount * 2 方法UTF8Encoding.GetMaxByteCount（charCount）返回charCount * 4.

但为什么那个？

.NET中的字符串已经是Unicode编码的。因此，如果将
字符串编码为字节数组，则每个字符都会得到字节数。

但是，对于UTF8编码，可以使用最多4个Unicode字符编码
最坏情况下的字节数。如果字符串碰巧只包含需要4字节编码的字符，则charCount * 4只是最糟糕的情况。

你想要吗？说UTF-8中的两个struct Char实例可以占用8

字节？

Vladimir写道：
方法UnicodeEncoding.GetMaxByteCount（charCount）返回charCount * 2.
法UTF8Encoding.GetMaxByteCount（charCount）返回charCount * 4.
<但是为什么呢？

.NET中的字符串已经是Unicode编码的。因此，如果将
字符串编码为字节数组，则每个字符都会得到字节数。

但是，对于UTF8编码，可以使用最多4个Unicode字符编码
最坏情况下的字节数。 charCount * 4只是一个最糟糕的情况
情况，如果字符串碰巧只包含需要4字节编码的字符。

你想说那两个UTF-8中的struct Char实例可占用8个字节？

事实证明，UTF8字符最多可占用4个字节

编码，对于Framework，struct Char总是可以编码为
最多3个字节。那是因为struct char保存了16位Unicode

值，并且总是可以编码为3个或更少的字节。

A 4只有Unicode代码点才需要-byte UTF8编码，这需要''代理''或一对16位值代表

字符。代理不能用单个结构Char表示 -

但我相信它们在字符串中得到支持。

无论如何，这里是使用struct可能发生的事情Char：

char c1 =''\ uFFFF'';

char c2 =''\ u1000'';

byte [] utf8bytes = UTF8Encoding.GetBytes（new char [] {c1，c2}）;

如果转储字节数组，你会看到每个Char编码为3个
UTF8字节。

Jon Skeet撰写了一篇关于此类问题的优秀文章：
http://www.yoda.arachsys.com/csharp/unicode .html

-

mikeb

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?

Look:

/*
Each Unicode character in a string is defined by a Unicode scalar value,
also called ...

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.
*/

With UTF-8 encoding one instance of struct Char can only occupy 1/2, 1, 1
1/2, 2 bytes?
Isn''t it?
Therefore UTF8Encoding.GetMaxByteCount(charCount) must returns charCount *
2.
Because charCount means count of instance of struct Char.
Or not? May be it means count of Unicode characters?
If not, then UnicodeEncoding.GetMaxByteCount(charCount) must returns
charCount * 4.

This methods does not fit each other.

解决方案

Vladimir wrote:

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?
Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

Look:

/*
Each Unicode character in a string is defined by a Unicode scalar value,
also called ...

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.
*/

With UTF-8 encoding one instance of struct Char can only occupy 1/2, 1, 1
1/2, 2 bytes?
Isn''t it?
Therefore UTF8Encoding.GetMaxByteCount(charCount) must returns charCount *
2.
Because charCount means count of instance of struct Char.
Or not? May be it means count of Unicode characters?
If not, then UnicodeEncoding.GetMaxByteCount(charCount) must returns
charCount * 4.

This methods does not fit each other.

--
mikeb

> > Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?

Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

Do you want to say that two instances of struct Char in UTF-8 can occupy 8
bytes?

Vladimir wrote:
Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?

Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

Do you want to say that two instances of struct Char in UTF-8 can occupy 8
bytes?

It turns out that while a UTF8 character can take up to 4 bytes to be
encoded, for the Framework, a struct Char can always be encoded in at
most 3 bytes. That''s because the struct char holds a 16-bit Unicode
value, and that can always be encoded in 3 or fewer bytes.

A 4-byte UTF8 encoding is only needed for Unicode code points that
require ''surrogates'' - or a pair of 16-bit values to represent the
character. Surrogates cannot be represented in a single struct Char -
but I believe they are supported in strings.

Anyway, here''s what can happen using struct Char:

char c1 = ''\uFFFF'';
char c2 = ''\u1000'';

byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });

If you dump the byte array, you''ll see that each Char was encoded into 3
UTF8 bytes.

Jon Skeet has written an excellent article on this type of issue:

http://www.yoda.arachsys.com/csharp/unicode.html

--
mikeb

这篇关于编码时字符的字节大小的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

编码时字符的字节大小 [英] Byte size of characters when encoding

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

编码时字符的字节大小 [英] Byte size of characters when encoding

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭