UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - 字符串国际化(框架过度?) [英] UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - Internationalization of Strings (Framework oversite?)
问题描述
我正在.NET中实现RFC 3491,并遇到一个奇怪的问题。
RFC 3491的第1步执行一组由表B.1指定的映射并且
B.2。
我遇到以下映射时遇到问题,看起来好像是一个
.NET框架的缺点:
当我看到Unicode值0x10400时,我应该将它映射到值0x10428。
此列表继续(左列是现有值,右列是
是替换值):
(值是以十六进制表示)
10400; 10428;案例地图
10401; 10429;案例地图
10402; 1042A;案例地图
10403; 1042B;案例地图
10404; 1042C;案例地图
10405; 1042D;案例地图
10406; 1042E;案例地图
10407; 1042F;案例地图
10408; 10430;案例地图
(......还有几千行...)
我已经加载了字符串一个StringBuilder,我一次遍历
一个字符,并将字符值与映射
值进行比较。问题是Character不能有大于
0xFFFF的值。 Unicode 3.2的UTF8和UTF16编码都允许值大于0xFFFF的值。
我可以使用这种方法的解决方法,还是我必须
将所有内容转换为Bytes并以此方式执行此操作?
-
Chris Mullins
您是否考虑过使用long数组?值? .NET中的所有字符串
库都假定为Unicode,每个字符为2个字节。
或者,您可以使用struct包含长的取代
long。这样可以更容易地对你的角色转换进行分组。
" Chris Mullins" <厘米****** @ yahoo.com>在留言中写道
新闻:ee *************** @ TK2MSFTNGP11.phx.gbl ...我是在.NET中实现RFC 3491,并遇到一个奇怪的问题。
RFC 3491的第1步执行一组由表B.1
和B.2指定的映射。但是,我遇到了以下映射的问题,这似乎是.NET框架的一个缺点:
当我看到Unicode值0x10400时,我我应该把它映射到价值
0x10428。此列表继续(左列是现有值,右侧
列是替换值):
(值以十六进制表示)
10400; 10428;案例地图
10401; 10429;案例地图
10402; 1042A;案例地图
10403; 1042B;案例地图
10404; 1042C;案例地图
10405; 1042D;案例地图
10406; 1042E;案例地图
10407; 1042F;案例地图
10408; 10430;案例图
(......以及另外几千行...)
我已经将字符串加载到StringBuilder中,并且正在迭代
一次一个字符,并将字符值与
映射值进行比较。问题是Character不能有大于
0xFFFF的值。 Unicode 3.2的UTF8和UTF16编码都允许值大于0xFFFF。
我可以使用这种方法的解决方法,还是我必须转换所有内容对于字节这么做了吗?
-
Chris Mullins
不幸的是,每个字符2个字节 - 这是
..NET假设的大部分库是不够的。 .NETchar价值,仅适用于
不对称范围-32768至+65535(这对于几乎
eveything足够......除了代理对)。因为所有内容都基于
Chars,所以我无法弄清楚如何将任意Unicode代码点正确地编码为任何编码。
正确编码为任何编码。问题是Unicode
代理对之一,这是支持的,但我不知道如何正确地对b
进行编码...
如果只有一个UTF.Encoder方法编码一个真正的Unicode代码
Point(0到10FFFF之间的任何值),而不是一个char()数组。有一个简单的方法可以解决这个问题,但这对我来说并不明显......
我想我可以手动将我的值编码为一系列UTF8字节,但
确实看起来很难看。
-
Chris
Jason Smith < JA *** @ nospam.com>在消息中写道
新闻:OU ************** @ TK2MSFTNGP10.phx.gbl ...你有没有想过使用一长串长的值? .NET中的所有字符串
库都假定为Unicode,每个字符为2个字节。
或者,您可以使用struct。包含长的取代
长。这样可以更容易地对你的角色转换进行分组。
Chris Mullins <厘米****** @ yahoo.com>在消息中写道
新闻:ee *************** @ TK2MSFTNGP11.phx.gbl ...我正在实施RFC 3491 .NET,并遇到了一个奇怪的问题。
RFC 3491的第1步执行一组由表B.1
和B.2。
我在使用以下映射时遇到了麻烦,看起来似乎是.NET框架的缺点:
我看到Unicode值0x10400,我应该把它映射到值
0x10428。这个列表继续(左列是现有值,右边
列是替换值):
(值以十六进制表示)
10400; 10428;案例地图
10401; 10429;案例地图
10402; 1042A;案例地图
10403; 1042B;案例地图
10404; 1042C;案例地图
10405; 1042D;案例地图
10406; 1042E;案例地图
10407; 1042F;案例地图
10408; 10430;案例图
(......以及另外几千行...)
我已经将字符串加载到StringBuilder中,并且迭代
通过它一次一个字符,并将字符值与
映射值进行比较。问题是Character不能有大于
0xFFFF的值。 Unicode 3.2的UTF8和UTF16编码都允许值大于0xFFFF。
我可以使用这种方法的解决方法,还是我必须转换所有内容对于字节这么做了吗?
-
Chris Mullins
Chris Mullins< cm ****** @ yahoo.com>写道:不幸的是,每个字符2个字节 - 这是
.NET假设的大部分库 - 是不够的。 .NETchar值,只适用于
不对称范围-32768到+65535(这对于几乎所有的东西都是足够的......除了代理对)。
Char实际上是0-65535。范围-32768到65535不能以16位存储
。
因为所有内容都基于Chars,我可以''弄清楚如何获得任意Unicode代码点以正确编码到任何编码中。问题是支持Unicode的代理对之一,但我不知道如何正确编码一个......
请参阅我最近的帖子 -
http:/ $uk $。 br />
找到很多页面。)
-
Jon Skeet - < sk *** @ pobox.com>
http://www.pobox.com/~skeet
如果回复小组,请不要给我发邮件
I''m implementing RFC 3491 in .NET, and running into a strange issue.
Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and
B.2.
I''m having trouble with the following mappings though, and it seems like a
shortcoming of the .NET framework:
When I see Unicode value 0x10400, I''m supposed to map it to value 0x10428.
This list goes on (the left colulmn is the existing value, the right column
is the replacement value):
(values are in HEX)
10400; 10428; Case map
10401; 10429; Case map
10402; 1042A; Case map
10403; 1042B; Case map
10404; 1042C; Case map
10405; 1042D; Case map
10406; 1042E; Case map
10407; 1042F; Case map
10408; 10430; Case map
(... and on for another few thousand lines...)
I''ve got the strings loaded into a StringBuilder, and am iterating through
it one character at a time, and comparing the character value to the mapping
values. The problem is that a Character cannot have a value greater than
0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values
larger than 0xFFFF.
Is there a workaround to this approach that I can use, or do I have to
convert everything to Bytes and do this the hard way?
--
Chris Mullins
解决方案Have you thought about using an array of "long" values? All the string
libraries in .NET assume Unicode, which is 2 bytes per character.
Alternately, you might use a "struct" containing a "long" in place of a
"long." That would just make it easier to group your character conversion
routines.
"Chris Mullins" <cm******@yahoo.com> wrote in message
news:ee***************@TK2MSFTNGP11.phx.gbl...I''m implementing RFC 3491 in .NET, and running into a strange issue.
Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2.
I''m having trouble with the following mappings though, and it seems like a
shortcoming of the .NET framework:
When I see Unicode value 0x10400, I''m supposed to map it to value 0x10428. This list goes on (the left colulmn is the existing value, the right column is the replacement value):
(values are in HEX)
10400; 10428; Case map
10401; 10429; Case map
10402; 1042A; Case map
10403; 1042B; Case map
10404; 1042C; Case map
10405; 1042D; Case map
10406; 1042E; Case map
10407; 1042F; Case map
10408; 10430; Case map
(... and on for another few thousand lines...)
I''ve got the strings loaded into a StringBuilder, and am iterating through
it one character at a time, and comparing the character value to the mapping values. The problem is that a Character cannot have a value greater than
0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values
larger than 0xFFFF.
Is there a workaround to this approach that I can use, or do I have to
convert everything to Bytes and do this the hard way?
--
Chris Mullins
Unfortunatly, 2 bytes per character - which is what much of the libraries in
..NET assume - is not sufficient. The .NET "char" value, is only good for the
assymetric range -32768 to +65535 (this is sufficient for almost
eveything... except for surrogate pairs). Because everything is based off
"Chars", I can''t figure out how to get an arbitrary Unicode Code Point to
properly encode into any of the encodings. The problem is one of Unicode
surrogate pairs, which are supported, but I can''t figure out how to properly
encode one...
If only there were a UTF.Encoder method that encoded a true Unicode Code
Point (any value from 0 to 10FFFF), rather than a a char() array. There''s
got to be a simple way around this, but it''s not evident to me...
I suppose I could manually encode my value into a series of UTF8 bytes, but
that sure seems ugly.
--
Chris
"Jason Smith" <ja***@nospam.com> wrote in message
news:OU**************@TK2MSFTNGP10.phx.gbl...Have you thought about using an array of "long" values? All the string
libraries in .NET assume Unicode, which is 2 bytes per character.
Alternately, you might use a "struct" containing a "long" in place of a
"long." That would just make it easier to group your character conversion
routines.
"Chris Mullins" <cm******@yahoo.com> wrote in message
news:ee***************@TK2MSFTNGP11.phx.gbl...I''m implementing RFC 3491 in .NET, and running into a strange issue.
Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1
andB.2.
I''m having trouble with the following mappings though, and it seems like a shortcoming of the .NET framework:
When I see Unicode value 0x10400, I''m supposed to map it to value
0x10428.This list goes on (the left colulmn is the existing value, the right
columnis the replacement value):
(values are in HEX)
10400; 10428; Case map
10401; 10429; Case map
10402; 1042A; Case map
10403; 1042B; Case map
10404; 1042C; Case map
10405; 1042D; Case map
10406; 1042E; Case map
10407; 1042F; Case map
10408; 10430; Case map
(... and on for another few thousand lines...)
I''ve got the strings loaded into a StringBuilder, and am iterating through it one character at a time, and comparing the character value to the
mappingvalues. The problem is that a Character cannot have a value greater than
0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values
larger than 0xFFFF.
Is there a workaround to this approach that I can use, or do I have to
convert everything to Bytes and do this the hard way?
--
Chris Mullins
Chris Mullins <cm******@yahoo.com> wrote:Unfortunatly, 2 bytes per character - which is what much of the libraries in
.NET assume - is not sufficient. The .NET "char" value, is only good for the
assymetric range -32768 to +65535 (this is sufficient for almost
eveything... except for surrogate pairs).
Char is actually 0-65535. The range -32768 to 65535 couldn''t be stored
in 16 bits.
Because everything is based off
"Chars", I can''t figure out how to get an arbitrary Unicode Code Point to
properly encode into any of the encodings. The problem is one of Unicode
surrogate pairs, which are supported, but I can''t figure out how to properly
encode one...
See my recent post - and
http://uk.geocities.com/BabelStone13...urrogates.html
(amongst other pages - a google search for
Unicode "surrogate pairs"
finds a lot of pages.)
--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
这篇关于UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - 字符串国际化(框架过度?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!