UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - 字符串国际化(框架过度?) [英] UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - Internationalization of Strings (Framework oversite?)

查看:55
本文介绍了UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - 字符串国际化(框架过度?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在.NET中实现RFC 3491,并遇到一个奇怪的问题。


RFC 3491的第1步执行一组由表B.1指定的映射并且

B.2。


我遇到以下映射时遇到问题,看起来好像是一个

.NET框架的缺点:


当我看到Unicode值0x10400时,我应该将它映射到值0x10428。

此列表继续(左列是现有值,右列是

是替换值):

(值是以十六进制表示)


10400; 10428;案例地图

10401; 10429;案例地图

10402; 1042A;案例地图

10403; 1042B;案例地图

10404; 1042C;案例地图

10405; 1042D;案例地图

10406; 1042E;案例地图

10407; 1042F;案例地图

10408; 10430;案例地图


(......还有几千行...)


我已经加载了字符串一个StringBuilder,我一次遍历

一个字符,并将字符值与映射

值进行比较。问题是Character不能有大于

0xFFFF的值。 Unicode 3.2的UTF8和UTF16编码都允许值大于0xFFFF的值。


我可以使用这种方法的解决方法,还是我必须

将所有内容转换为Bytes并以此方式执行此操作?


-

Chris Mullins

解决方案

您是否考虑过使用long数组?值? .NET中的所有字符串

库都假定为Unicode,每个字符为2个字节。


或者,您可以使用struct包含长的取代

long。这样可以更容易地对你的角色转换进行分组。

" Chris Mullins" <厘米****** @ yahoo.com>在留言中写道

新闻:ee *************** @ TK2MSFTNGP11.phx.gbl ...

我是在.NET中实现RFC 3491,并遇到一个奇怪的问题。

RFC 3491的第1步执行一组由表B.1
和B.2指定的映射。但是,我遇到了以下映射的问题,这似乎是.NET框架的一个缺点:

当我看到Unicode值0x10400时,我我应该把它映射到价值
0x10428。此列表继续(左列是现有值,右侧
列是替换值):
(值以十六进制表示)

10400; 10428;案例地图
10401; 10429;案例地图
10402; 1042A;案例地图
10403; 1042B;案例地图
10404; 1042C;案例地图
10405; 1042D;案例地图
10406; 1042E;案例地图
10407; 1042F;案例地图
10408; 10430;案例图

(......以及另外几千行...)

我已经将字符串加载到StringBuilder中,并且正在迭代
一次一个字符,并将字符值与
映射值进行比较。问题是Character不能有大于
0xFFFF的值。 Unicode 3.2的UTF8和UTF16编码都允许值大于0xFFFF。

我可以使用这种方法的解决方法,还是我必须转换所有内容对于字节这么做了吗?

-
Chris Mullins



不幸的是,每个字符2个字节 - 这是

..NET假设的大部分库是不够的。 .NETchar价值,仅适用于

不对称范围-32768至+65535(这对于几乎

eveything足够......除了代理对)。因为所有内容都基于

Chars,所以我无法弄清楚如何将任意Unicode代码点正确地编码为任何编码。
正确编码为任何编码。问题是Unicode

代理对之一,这是支持的,但我不知道如何正确地对b
进行编码...


如果只有一个UTF.Encoder方法编码一个真正的Unicode代码

Point(0到10FFFF之间的任何值),而不是一个char()数组。有一个简单的方法可以解决这个问题,但这对我来说并不明显......


我想我可以手动将我的值编码为一系列UTF8字节,但

确实看起来很难看。


-

Chris


Jason Smith < JA *** @ nospam.com>在消息中写道

新闻:OU ************** @ TK2MSFTNGP10.phx.gbl ...

你有没有想过使用一长串长的值? .NET中的所有字符串
库都假定为Unicode,每个字符为2个字节。

或者,您可以使用struct。包含长的取代
长。这样可以更容易地对你的角色转换进行分组。

Chris Mullins <厘米****** @ yahoo.com>在消息中写道
新闻:ee *************** @ TK2MSFTNGP11.phx.gbl ...

我正在实施RFC 3491 .NET,并遇到了一个奇怪的问题。

RFC 3491的第1步执行一组由表B.1


B.2。

我在使用以下映射时遇到了麻烦,看起来似乎是.NET框架的缺点:

我看到Unicode值0x10400,我应该把它映射到值
0x10428。

这个列表继续(左列是现有值,右边


是替换值):
(值以十六进制表示)

10400; 10428;案例地图
10401; 10429;案例地图
10402; 1042A;案例地图
10403; 1042B;案例地图
10404; 1042C;案例地图
10405; 1042D;案例地图
10406; 1042E;案例地图
10407; 1042F;案例地图
10408; 10430;案例图

(......以及另外几千行...)

我已经将字符串加载到StringBuilder中,并且迭代
通过它一次一个字符,并将字符值与


映射

值进行比较。问题是Character不能有大于
0xFFFF的值。 Unicode 3.2的UTF8和UTF16编码都允许值大于0xFFFF。

我可以使用这种方法的解决方法,还是我必须转换所有内容对于字节这么做了吗?

-
Chris Mullins




Chris Mullins< cm ****** @ yahoo.com>写道:

不幸的是,每个字符2个字节 - 这是
.NET假设的大部分库 - 是不够的。 .NETchar值,只适用于
不对称范围-32768到+65535(这对于几乎所有的东西都是足够的......除了代理对)。


Char实际上是0-65535。范围-32768到65535不能以16位存储



因为所有内容都基于Chars,我可以''弄清楚如何获得任意Unicode代码点以正确编码到任何编码中。问题是支持Unicode的代理对之一,但我不知道如何正确编码一个......




请参阅我最近的帖子 -
http:/ $uk $。 br />
找到很多页面。)


-

Jon Skeet - < sk *** @ pobox.com>
http://www.pobox.com/~skeet

如果回复小组,请不要给我发邮件


I''m implementing RFC 3491 in .NET, and running into a strange issue.

Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and
B.2.

I''m having trouble with the following mappings though, and it seems like a
shortcoming of the .NET framework:

When I see Unicode value 0x10400, I''m supposed to map it to value 0x10428.
This list goes on (the left colulmn is the existing value, the right column
is the replacement value):
(values are in HEX)

10400; 10428; Case map
10401; 10429; Case map
10402; 1042A; Case map
10403; 1042B; Case map
10404; 1042C; Case map
10405; 1042D; Case map
10406; 1042E; Case map
10407; 1042F; Case map
10408; 10430; Case map

(... and on for another few thousand lines...)

I''ve got the strings loaded into a StringBuilder, and am iterating through
it one character at a time, and comparing the character value to the mapping
values. The problem is that a Character cannot have a value greater than
0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values
larger than 0xFFFF.

Is there a workaround to this approach that I can use, or do I have to
convert everything to Bytes and do this the hard way?

--
Chris Mullins

解决方案

Have you thought about using an array of "long" values? All the string
libraries in .NET assume Unicode, which is 2 bytes per character.

Alternately, you might use a "struct" containing a "long" in place of a
"long." That would just make it easier to group your character conversion
routines.

"Chris Mullins" <cm******@yahoo.com> wrote in message
news:ee***************@TK2MSFTNGP11.phx.gbl...

I''m implementing RFC 3491 in .NET, and running into a strange issue.

Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2.

I''m having trouble with the following mappings though, and it seems like a
shortcoming of the .NET framework:

When I see Unicode value 0x10400, I''m supposed to map it to value 0x10428. This list goes on (the left colulmn is the existing value, the right column is the replacement value):
(values are in HEX)

10400; 10428; Case map
10401; 10429; Case map
10402; 1042A; Case map
10403; 1042B; Case map
10404; 1042C; Case map
10405; 1042D; Case map
10406; 1042E; Case map
10407; 1042F; Case map
10408; 10430; Case map

(... and on for another few thousand lines...)

I''ve got the strings loaded into a StringBuilder, and am iterating through
it one character at a time, and comparing the character value to the mapping values. The problem is that a Character cannot have a value greater than
0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values
larger than 0xFFFF.

Is there a workaround to this approach that I can use, or do I have to
convert everything to Bytes and do this the hard way?

--
Chris Mullins



Unfortunatly, 2 bytes per character - which is what much of the libraries in
..NET assume - is not sufficient. The .NET "char" value, is only good for the
assymetric range -32768 to +65535 (this is sufficient for almost
eveything... except for surrogate pairs). Because everything is based off
"Chars", I can''t figure out how to get an arbitrary Unicode Code Point to
properly encode into any of the encodings. The problem is one of Unicode
surrogate pairs, which are supported, but I can''t figure out how to properly
encode one...

If only there were a UTF.Encoder method that encoded a true Unicode Code
Point (any value from 0 to 10FFFF), rather than a a char() array. There''s
got to be a simple way around this, but it''s not evident to me...

I suppose I could manually encode my value into a series of UTF8 bytes, but
that sure seems ugly.

--
Chris

"Jason Smith" <ja***@nospam.com> wrote in message
news:OU**************@TK2MSFTNGP10.phx.gbl...

Have you thought about using an array of "long" values? All the string
libraries in .NET assume Unicode, which is 2 bytes per character.

Alternately, you might use a "struct" containing a "long" in place of a
"long." That would just make it easier to group your character conversion
routines.

"Chris Mullins" <cm******@yahoo.com> wrote in message
news:ee***************@TK2MSFTNGP11.phx.gbl...

I''m implementing RFC 3491 in .NET, and running into a strange issue.

Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1


and

B.2.

I''m having trouble with the following mappings though, and it seems like a shortcoming of the .NET framework:

When I see Unicode value 0x10400, I''m supposed to map it to value


0x10428.

This list goes on (the left colulmn is the existing value, the right


column

is the replacement value):
(values are in HEX)

10400; 10428; Case map
10401; 10429; Case map
10402; 1042A; Case map
10403; 1042B; Case map
10404; 1042C; Case map
10405; 1042D; Case map
10406; 1042E; Case map
10407; 1042F; Case map
10408; 10430; Case map

(... and on for another few thousand lines...)

I''ve got the strings loaded into a StringBuilder, and am iterating through it one character at a time, and comparing the character value to the


mapping

values. The problem is that a Character cannot have a value greater than
0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values
larger than 0xFFFF.

Is there a workaround to this approach that I can use, or do I have to
convert everything to Bytes and do this the hard way?

--
Chris Mullins




Chris Mullins <cm******@yahoo.com> wrote:

Unfortunatly, 2 bytes per character - which is what much of the libraries in
.NET assume - is not sufficient. The .NET "char" value, is only good for the
assymetric range -32768 to +65535 (this is sufficient for almost
eveything... except for surrogate pairs).
Char is actually 0-65535. The range -32768 to 65535 couldn''t be stored
in 16 bits.
Because everything is based off
"Chars", I can''t figure out how to get an arbitrary Unicode Code Point to
properly encode into any of the encodings. The problem is one of Unicode
surrogate pairs, which are supported, but I can''t figure out how to properly
encode one...



See my recent post - and
http://uk.geocities.com/BabelStone13...urrogates.html
(amongst other pages - a google search for
Unicode "surrogate pairs"
finds a lot of pages.)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too


这篇关于UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - 字符串国际化(框架过度?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆