UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - 字符串国际化（框架过度？） [英] UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - Internationalization of Strings (Framework oversite?)

查看：55 发布时间：2019/6/5 15:13:03 net

本文介绍了UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - 字符串国际化（框架过度？）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在.NET中实现RFC 3491，并遇到一个奇怪的问题。

RFC 3491的第1步执行一组由表B.1指定的映射并且

B.2。

我遇到以下映射时遇到问题，看起来好像是一个

.NET框架的缺点：

当我看到Unicode值0x10400时，我应该将它映射到值0x10428。

此列表继续（左列是现有值，右列是

是替换值）：

（值是以十六进制表示）

10400; 10428;案例地图

10401; 10429;案例地图

10402; 1042A;案例地图

10403; 1042B;案例地图

10404; 1042C;案例地图

10405; 1042D;案例地图

10406; 1042E;案例地图

10407; 1042F;案例地图

10408; 10430;案例地图

（......还有几千行...）

我已经加载了字符串一个StringBuilder，我一次遍历

一个字符，并将字符值与映射

值进行比较。问题是Character不能有大于

0xFFFF的值。 Unicode 3.2的UTF8和UTF16编码都允许值大于0xFFFF的值。

我可以使用这种方法的解决方法，还是我必须

将所有内容转换为Bytes并以此方式执行此操作？

-

Chris Mullins

解决方案

您是否考虑过使用long数组？值？ .NET中的所有字符串

库都假定为Unicode，每个字符为2个字节。

或者，您可以使用struct包含长的取代

long。这样可以更容易地对你的角色转换进行分组。

" Chris Mullins" <厘米****** @ yahoo.com>在留言中写道

新闻：ee *************** @ TK2MSFTNGP11.phx.gbl ...
我是在.NET中实现RFC 3491，并遇到一个奇怪的问题。

RFC 3491的第1步执行一组由表B.1
和B.2指定的映射。但是，我遇到了以下映射的问题，这似乎是.NET框架的一个缺点：

当我看到Unicode值0x10400时，我我应该把它映射到价值
0x10428。此列表继续（左列是现有值，右侧
列是替换值）：
（值以十六进制表示）

10400; 10428;案例地图
10401; 10429;案例地图
10402; 1042A;案例地图
10403; 1042B;案例地图
10404; 1042C;案例地图
10405; 1042D;案例地图
10406; 1042E;案例地图
10407; 1042F;案例地图
10408; 10430;案例图

（......以及另外几千行...）

我已经将字符串加载到StringBuilder中，并且正在迭代
一次一个字符，并将字符值与
映射值进行比较。问题是Character不能有大于
0xFFFF的值。 Unicode 3.2的UTF8和UTF16编码都允许值大于0xFFFF。

我可以使用这种方法的解决方法，还是我必须转换所有内容对于字节这么做了吗？

-
Chris Mullins

不幸的是，每个字符2个字节 - 这是

..NET假设的大部分库是不够的。 .NETchar价值，仅适用于

不对称范围-32768至+65535（这对于几乎

eveything足够......除了代理对）。因为所有内容都基于

Chars，所以我无法弄清楚如何将任意Unicode代码点正确地编码为任何编码。
正确编码为任何编码。问题是Unicode

代理对之一，这是支持的，但我不知道如何正确地对b
进行编码...

如果只有一个UTF.Encoder方法编码一个真正的Unicode代码

Point（0到10FFFF之间的任何值），而不是一个char（）数组。有一个简单的方法可以解决这个问题，但这对我来说并不明显......

我想我可以手动将我的值编码为一系列UTF8字节，但

确实看起来很难看。

-

Chris

Jason Smith < JA *** @ nospam.com>在消息中写道

新闻：OU ************** @ TK2MSFTNGP10.phx.gbl ...
你有没有想过使用一长串长的值？ .NET中的所有字符串
库都假定为Unicode，每个字符为2个字节。

或者，您可以使用struct。包含长的取代
长。这样可以更容易地对你的角色转换进行分组。

Chris Mullins <厘米****** @ yahoo.com>在消息中写道
新闻：ee *************** @ TK2MSFTNGP11.phx.gbl ...
我正在实施RFC 3491 .NET，并遇到了一个奇怪的问题。

RFC 3491的第1步执行一组由表B.1

和
B.2。

我在使用以下映射时遇到了麻烦，看起来似乎是.NET框架的缺点：

我看到Unicode值0x10400，我应该把它映射到值
0x10428。
这个列表继续（左列是现有值，右边

列
是替换值）：
（值以十六进制表示）

10400; 10428;案例地图
10401; 10429;案例地图
10402; 1042A;案例地图
10403; 1042B;案例地图
10404; 1042C;案例地图
10405; 1042D;案例地图
10406; 1042E;案例地图
10407; 1042F;案例地图
10408; 10430;案例图

（......以及另外几千行...）

我已经将字符串加载到StringBuilder中，并且迭代
通过它一次一个字符，并将字符值与

映射
值进行比较。问题是Character不能有大于
0xFFFF的值。 Unicode 3.2的UTF8和UTF16编码都允许值大于0xFFFF。

我可以使用这种方法的解决方法，还是我必须转换所有内容对于字节这么做了吗？

-
Chris Mullins

Chris Mullins< cm ****** @ yahoo.com>写道：
不幸的是，每个字符2个字节 - 这是
.NET假设的大部分库 - 是不够的。 .NETchar值，只适用于
不对称范围-32768到+65535（这对于几乎所有的东西都是足够的......除了代理对）。

Char实际上是0-65535。范围-32768到65535不能以16位存储

。

因为所有内容都基于Chars，我可以''弄清楚如何获得任意Unicode代码点以正确编码到任何编码中。问题是支持Unicode的代理对之一，但我不知道如何正确编码一个......

请参阅我最近的帖子 -
http：/ $uk $。 br />
找到很多页面。）

-

Jon Skeet - < sk *** @ pobox.com>
http://www.pobox.com/~skeet

如果回复小组，请不要给我发邮件

I''m implementing RFC 3491 in .NET, and running into a strange issue.

Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and
B.2.

I''m having trouble with the following mappings though, and it seems like a
shortcoming of the .NET framework:

When I see Unicode value 0x10400, I''m supposed to map it to value 0x10428.
This list goes on (the left colulmn is the existing value, the right column
is the replacement value):
(values are in HEX)

10400; 10428; Case map
10401; 10429; Case map
10402; 1042A; Case map
10403; 1042B; Case map
10404; 1042C; Case map
10405; 1042D; Case map
10406; 1042E; Case map
10407; 1042F; Case map
10408; 10430; Case map

(... and on for another few thousand lines...)

I''ve got the strings loaded into a StringBuilder, and am iterating through
it one character at a time, and comparing the character value to the mapping
values. The problem is that a Character cannot have a value greater than
0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values
larger than 0xFFFF.

Is there a workaround to this approach that I can use, or do I have to
convert everything to Bytes and do this the hard way?

--
Chris Mullins

解决方案
Have you thought about using an array of "long" values? All the string
libraries in .NET assume Unicode, which is 2 bytes per character.

Alternately, you might use a "struct" containing a "long" in place of a
"long." That would just make it easier to group your character conversion
routines.

"Chris Mullins" <cm******@yahoo.com> wrote in message
news:ee***************@TK2MSFTNGP11.phx.gbl...
I''m implementing RFC 3491 in .NET, and running into a strange issue.

Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2.

I''m having trouble with the following mappings though, and it seems like a
shortcoming of the .NET framework:

When I see Unicode value 0x10400, I''m supposed to map it to value 0x10428. This list goes on (the left colulmn is the existing value, the right column is the replacement value):
(values are in HEX)

10400; 10428; Case map
10401; 10429; Case map
10402; 1042A; Case map
10403; 1042B; Case map
10404; 1042C; Case map
10405; 1042D; Case map
10406; 1042E; Case map
10407; 1042F; Case map
10408; 10430; Case map

(... and on for another few thousand lines...)

I''ve got the strings loaded into a StringBuilder, and am iterating through
it one character at a time, and comparing the character value to the mapping values. The problem is that a Character cannot have a value greater than
0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values
larger than 0xFFFF.

Is there a workaround to this approach that I can use, or do I have to
convert everything to Bytes and do this the hard way?

--
Chris Mullins

Unfortunatly, 2 bytes per character - which is what much of the libraries in
..NET assume - is not sufficient. The .NET "char" value, is only good for the
assymetric range -32768 to +65535 (this is sufficient for almost
eveything... except for surrogate pairs). Because everything is based off
"Chars", I can''t figure out how to get an arbitrary Unicode Code Point to
properly encode into any of the encodings. The problem is one of Unicode
surrogate pairs, which are supported, but I can''t figure out how to properly
encode one...

If only there were a UTF.Encoder method that encoded a true Unicode Code
Point (any value from 0 to 10FFFF), rather than a a char() array. There''s
got to be a simple way around this, but it''s not evident to me...

I suppose I could manually encode my value into a series of UTF8 bytes, but
that sure seems ugly.

--
Chris

"Jason Smith" <ja***@nospam.com> wrote in message
news:OU**************@TK2MSFTNGP10.phx.gbl...
Have you thought about using an array of "long" values? All the string
libraries in .NET assume Unicode, which is 2 bytes per character.

Alternately, you might use a "struct" containing a "long" in place of a
"long." That would just make it easier to group your character conversion
routines.

"Chris Mullins" <cm******@yahoo.com> wrote in message
news:ee***************@TK2MSFTNGP11.phx.gbl...
I''m implementing RFC 3491 in .NET, and running into a strange issue.

Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1

and
B.2.

I''m having trouble with the following mappings though, and it seems like a shortcoming of the .NET framework:

When I see Unicode value 0x10400, I''m supposed to map it to value

0x10428.
This list goes on (the left colulmn is the existing value, the right

column
is the replacement value):
(values are in HEX)

10400; 10428; Case map
10401; 10429; Case map
10402; 1042A; Case map
10403; 1042B; Case map
10404; 1042C; Case map
10405; 1042D; Case map
10406; 1042E; Case map
10407; 1042F; Case map
10408; 10430; Case map

(... and on for another few thousand lines...)

I''ve got the strings loaded into a StringBuilder, and am iterating through it one character at a time, and comparing the character value to the

mapping
values. The problem is that a Character cannot have a value greater than
0xFFFF. Both UTF8 and UTF16 encodings of Unicode 3.2 allow for values
larger than 0xFFFF.

Is there a workaround to this approach that I can use, or do I have to
convert everything to Bytes and do this the hard way?

--
Chris Mullins

Chris Mullins <cm******@yahoo.com> wrote:
Unfortunatly, 2 bytes per character - which is what much of the libraries in
.NET assume - is not sufficient. The .NET "char" value, is only good for the
assymetric range -32768 to +65535 (this is sufficient for almost
eveything... except for surrogate pairs).
Char is actually 0-65535. The range -32768 to 65535 couldn''t be stored
in 16 bits.
Because everything is based off
"Chars", I can''t figure out how to get an arbitrary Unicode Code Point to
properly encode into any of the encodings. The problem is one of Unicode
surrogate pairs, which are supported, but I can''t figure out how to properly
encode one...

See my recent post - and
http://uk.geocities.com/BabelStone13...urrogates.html
(amongst other pages - a google search for
Unicode "surrogate pairs"
finds a lot of pages.)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

这篇关于UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - 字符串国际化（框架过度？）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - 字符串国际化（框架过度？） [英] UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - Internationalization of Strings (Framework oversite?)

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - 字符串国际化（框架过度？） [英] UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - Internationalization of Strings (Framework oversite?)

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭