编码时字符的字节大小 [英] Byte size of characters when encoding

查看:76
本文介绍了编码时字符的字节大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



法UnicodeEncoding.GetMaxByteCount(charCount)返回charCount * 2.

法UTF8Encoding.GetMaxByteCount(charCount)返回charCount * 4.


但为什么呢?


看:


/ *

每个Unicode字符string由Unicode标量值定义,

也称为...


索引是Char中的位置,而不是Unicode字符的位置串。

指数是一个从零开始的非负数,从字符串中的第一个位置

开始,即索引位置为零。连续索引值可能

不对应于连续的Unicode字符,因为Unicode字符

可能被编码为多个Char。要使用每个Unicode字符

而不是每个Char,请使用System.Globalization.StringInfo类。

* /


使用UTF-8编码时,struct Char的一个实例只占1/2,1,1

1/2,2个字节?

不是吗?

因此UTF8Encoding.GetMaxByteCount(charCount)必须返回charCount *

2.

因为charCount表示struct Char的实例计数。

或不是吗?可能是它意味着Unicode字符的数量?

如果没有,那么UnicodeEncoding.GetMaxByteCount(charCount)必须返回

charCount * 4.


这种方法并不适合彼此。

解决方案

Vladimir写道:

方法UnicodeEncoding。 GetMaxByteCount(charCount)返回charCount * 2.
方法UTF8Encoding.GetMaxByteCount(charCount)返回charCount * 4.

但为什么呢?


.NET中的字符串已经是Unicode编码的。因此,如果您将

字符串编码为一个字节数组,则每个字符都会得到字节数。


但是,对于UTF8编码,可以编码单个Unicode字符

在最坏的情况下最多使用4个字节。如果字符串碰巧只包含需要

4字节编码的字符,则charCount * 4只是最糟糕的情况

场景。

看:

/ *
字符串中的每个Unicode字符都由Unicode标量值定义,
也称为...

索引是的位置字符串中的字符,而不是Unicode字符。
索引是一个从零开始的非负数,从字符串中的第一个位置开始,即索引位置为零。连续索引值可能与连续的Unicode字符不对应,因为Unicode字符可能被编码为多个Char。要使用每个Unicode字符而不是每个Char,请使用System.Globalization.StringInfo类。
* /

使用UTF-8编码,struct Char的一个实例只能占用1 / 2,1,1
1/2,2个字节?
不是吗?
因此UTF8Encoding.GetMaxByteCount(charCount)必须返回charCount *
2。
因为charCount意味着结构Char的实例计数。
或不?可能是它意味着Unicode字符的数量?
如果没有,那么UnicodeEncoding.GetMaxByteCount(charCount)必须返回
charCount * 4.

这种方法不适合彼此。 br />



-

mikeb


> >方法UnicodeEncoding.GetMaxByteCount(charCount)返回charCount * 2 方法UTF8Encoding.GetMaxByteCount(charCount)返回charCount * 4.

但为什么那个?



.NET中的字符串已经是Unicode编码的。因此,如果将
字符串编码为字节数组,则每个字符都会得到字节数。

但是,对于UTF8编码,可以使用最多4个Unicode字符编码
最坏情况下的字节数。如果字符串碰巧只包含需要4字节编码的字符,则charCount * 4只是最糟糕的情况。




你想要吗?说UTF-8中的两个struct Char实例可以占用8

字节?


Vladimir写道:

方法UnicodeEncoding.GetMaxByteCount(charCount)返回charCount * 2.
法UTF8Encoding.GetMaxByteCount(charCount)返回charCount * 4.
<但是为什么呢?



.NET中的字符串已经是Unicode编码的。因此,如果将
字符串编码为字节数组,则每个字符都会得到字节数。

但是,对于UTF8编码,可以使用最多4个Unicode字符编码
最坏情况下的字节数。 charCount * 4只是一个最糟糕的情况
情况,如果字符串碰巧只包含需要4字节编码的字符。



你想说那两个UTF-8中的struct Char实例可占用8个字节?




事实证明,UTF8字符最多可占用4个字节

编码,对于Framework,struct Char总是可以编码为
最多3个字节。那是因为struct char保存了16位Unicode

值,并且总是可以编码为3个或更少的字节。


A 4只有Unicode代码点才需要-byte UTF8编码,这需要''代理''或一对16位值代表

字符。代理不能用单个结构Char表示 -

但我相信它们在字符串中得到支持。


无论如何,这里是使用struct可能发生的事情Char:


char c1 =''\ uFFFF'';

char c2 =''\ u1000'';


byte [] utf8bytes = UTF8Encoding.GetBytes(new char [] {c1,c2});


如果转储字节数组,你会看到每个Char编码为3个
UTF8字节。


Jon Skeet撰写了一篇关于此类问题的优秀文章:
http://www.yoda.arachsys.com/csharp/unicode .html


-

mikeb



Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?

Look:

/*
Each Unicode character in a string is defined by a Unicode scalar value,
also called ...

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.
*/

With UTF-8 encoding one instance of struct Char can only occupy 1/2, 1, 1
1/2, 2 bytes?
Isn''t it?
Therefore UTF8Encoding.GetMaxByteCount(charCount) must returns charCount *
2.
Because charCount means count of instance of struct Char.
Or not? May be it means count of Unicode characters?
If not, then UnicodeEncoding.GetMaxByteCount(charCount) must returns
charCount * 4.

This methods does not fit each other.

解决方案

Vladimir wrote:

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?
Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

Look:

/*
Each Unicode character in a string is defined by a Unicode scalar value,
also called ...

An index is the position of a Char, not a Unicode character, in a String. An
index is a zero-based, nonnegative number starting from the first position
in the string, which is index position zero. Consecutive index values might
not correspond to consecutive Unicode characters because a Unicode character
might be encoded as more than one Char. To work with each Unicode character
instead of each Char, use the System.Globalization.StringInfo class.
*/

With UTF-8 encoding one instance of struct Char can only occupy 1/2, 1, 1
1/2, 2 bytes?
Isn''t it?
Therefore UTF8Encoding.GetMaxByteCount(charCount) must returns charCount *
2.
Because charCount means count of instance of struct Char.
Or not? May be it means count of Unicode characters?
If not, then UnicodeEncoding.GetMaxByteCount(charCount) must returns
charCount * 4.

This methods does not fit each other.


--
mikeb


> > Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.

Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?



Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.



Do you want to say that two instances of struct Char in UTF-8 can occupy 8
bytes?


Vladimir wrote:

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

But why that?



Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.


Do you want to say that two instances of struct Char in UTF-8 can occupy 8
bytes?



It turns out that while a UTF8 character can take up to 4 bytes to be
encoded, for the Framework, a struct Char can always be encoded in at
most 3 bytes. That''s because the struct char holds a 16-bit Unicode
value, and that can always be encoded in 3 or fewer bytes.

A 4-byte UTF8 encoding is only needed for Unicode code points that
require ''surrogates'' - or a pair of 16-bit values to represent the
character. Surrogates cannot be represented in a single struct Char -
but I believe they are supported in strings.

Anyway, here''s what can happen using struct Char:

char c1 = ''\uFFFF'';
char c2 = ''\u1000'';

byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });

If you dump the byte array, you''ll see that each Char was encoded into 3
UTF8 bytes.

Jon Skeet has written an excellent article on this type of issue:

http://www.yoda.arachsys.com/csharp/unicode.html

--
mikeb


这篇关于编码时字符的字节大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆