UTF-16& wchar_t:关于C ++的第二个最糟糕的事情 [英] UTF-16 & wchar_t: the 2nd worst thing about C++

查看:68
本文介绍了UTF-16& wchar_t:关于C ++的第二个最糟糕的事情的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我开始使用

C ++时遇到的第一个障碍之一。我发现每个人都对字符串是什么有自己的想法。

是std :: string,QString,xercesc :: XMLString等。还有char,

wchar_t,QChar,XMLCh等,用于字符表示。来自

Java其中String是一个字符串是一个字符串,这是一个非常震惊。


嗯,我回来看这个,它仍然不漂亮。我发现了

似乎是QString和XMLCh之间的一种方式。据报道,XMLCh是
,是UFT-16。记录QString是相同的。 QString

为NTBS const

char *,std :: string和QString [**]之间的''转换''[*]提供了非常方便的功能。因此,使用QString作为中介,

我可以从const XMLCh * NTBS构造一个std :: string,从std ::构造一个const XMLCh *

NTBS字符串。


我的问题是我是否可以在没有QString中介的情况下执行此操作。那个

,给定一个UTF-16 NTBS,我可以构造一个std :: string代表

相同的字符吗?并且给定一个std :: string,我可以将其转换为UTF-16

NTBS吗?我被告知有些w_char实现是UTF-16,而

有些则不是。


我对ISO / IEC 14882:2003的解读是实现必须支持

UTF-16字符集[***],但不要求使用UTF-16编码。

表达成员的正确方法UTF-16字符集使用

形式\UNNNNNNNN [****],其中NNNNNNNN是通用字符名称,

或\ UNNNN,其中NNNN是一个

通用字符名称的字符短名称,其值为\ U0000NNNN,除非该字符是

a基本字符集的成员,或者如果表达的
字符的十六进制值小于0x20,或者表示的

字符的十六进制值是否在0x7f-0x9f(包括)范围内。

UTF-16字符集的成员也是基本字符集的成员是

,用L字符中的文字符号表示
literal,或L前缀字符串文字。


这告诉我Xerces XMLCh定义的UTF-16不符合

到C ++实现的扩展字符集的定义。
http://xml.apache.org/xerces-c/apiDo...pp-source.html


我对此的理解情况正确吗?


UTF-16似乎是运行时角色的语言Franka的好候选人

编码。 UTF-8在数据传输和

存储环境中更有意义。据我所知,C ++

标准不要求实现提供UTF-16字符串类。这是正确的吗?

[*]这里''转换''意味着要么进行类型转换,要么构造一个新的
变量来保存相同字符序列的不同编码。 br />

[**]不考虑未解答的问题,例如是否知道两者都是以UTF16编码的
足以假设表示相同。


[***]这里UTF-16用作ISO / IEC 10646中描述的UCS-2的同义词

Universal Multiple-Octet Coded Character Set。 ,虽然可能存在微妙的差异。

[****] \UNNNNNNN或\UNNNN中U的情况无关紧要。


那么C ++最糟糕的是什么? ''#''

-

NOUN:1。遗嘱遗留给他人的金钱或财产。 2.从祖先或前任或过去那里下来的东西:b
宗教自由的遗产。 ETYMOLOGY:MidE legacie,副手办公室,来自OF,来自ML legatia的
,来自L legare,以及deute,遗赠。 www.bartleby.com/61/

This is one of the first obstacles I encountered when getting started with
C++. I found that everybody had their own idea of what a string is. There
was std::string, QString, xercesc::XMLString, etc. There are also char,
wchar_t, QChar, XMLCh, etc., for character representation. Coming from
Java where a String is a String is a String, that was quite a shock.

Well, I''m back to looking at this, and it still isn''t pretty. I''ve found
what appears to be a way to go between QString and XMLCh. XMLCh is
reported to be UFT-16. QString is documented to be the same. QString
provides very convenient functions for ''converting''[*] between NTBS const
char*, std::string and QString[**]. So, using QString as an intermediary,
I can construct a std::string from a const XMLCh* NTBS, and a const XMLCh*
NTBS from a std::string.

My question is whether I can do this without the QString intermediary. That
is, given a UTF-16 NTBS, can I construct a std::string representing the
same characters? And given a std::string, can I convert it to a UTF-16
NTBS? I have been told that some w_char implementations are UTF-16, and
some are not.

My reading of the ISO/IEC 14882:2003 is that implementations must support
the UTF-16 character set[***], but are not required to use UTF-16 encoding.
The proper way to express a member of the UTF-16 character set is to use
the form \UNNNNNNNN[****], where NNNNNNNN is the universal-character-name,
or \UNNNN, where NNNN is the character short name of the a
universal-character-name whose value is \U0000NNNN, unless the character is
a member of the basic character set, or if the hexadecimal value of the
character expressed is less than 0x20, or if the hexadecimal value of the
character expressed is in the range 0x7f-0x9f (inclusive). Members of the
UTF-16 character set which are also members of the basic character set are
to be expressed using their literal symbol in an L-prefixed character
literal, or an L-prefixed string literal.

This tells me that the UTF-16 defined by the Xerces XMLCh does not conform
to the definition of the extended character set of a C++ implementation.
http://xml.apache.org/xerces-c/apiDo...pp-source.html

Is my understanding of this situation correct?

UTF-16 seems to be a good candidate for a lingua Franka of runtime character
encodings. UTF-8 makes more sense in the context of data transmission and
storage. As far as I can tell, implementations are not required by the C++
Standard to provide a UTF-16 string class. Is that correct?
[*] here ''converting'' means either type conversion or constructing a new
variable to hold a different encoding of the same characters sequence.

[**]Leaving aside unanswered questions such as whether knowing that both are
encoded as UTF16 is sufficient to assume the representations are identical.

[***]Here UTF-16 is used as a synonym for UCS-2 described in ISO/IEC 10646
"Universal Multiple-Octet Coded Character Set", though there may be subtle
differences.
[****] the case of the ''U'' in \UNNNNNNNN or \UNNNN is irrelevant.

So what is the worst thing about C++? ''#''
--
NOUN:1. Money or property bequeathed to another by will. 2. Something handed
down from an ancestor or a predecessor or from the past: a legacy of
religious freedom. ETYMOLOGY: MidE legacie, office of a deputy, from OF,
from ML legatia, from L legare, to depute, bequeath. www.bartleby.com/61/

推荐答案

Steven T. Hatton写道:
Steven T. Hatton wrote:
UTF-16似乎是一个很好的候选人,用于语言Franka的运行时
字符
编码。 UTF-8在数据传输和存储环境中更有意义。据我所知,C ++
标准不要求实现提供UTF-16字符串类。这是正确的吗?
UTF-16 seems to be a good candidate for a lingua Franka of runtime
character
encodings. UTF-8 makes more sense in the context of data transmission and
storage. As far as I can tell, implementations are not required by the C++
Standard to provide a UTF-16 string class. Is that correct?




要回答我自己的问题,这似乎是部分正确的。

实现必须为

所有支持的语言环境所需的UTF-16子集提供支持。


我问过为了澄清为什么Xerces-C使用不同的数据类型

而不是wchar_t来保存UTF-16。一个人回答说,

标准没有规定wchar_t是16位。他指出GCC

对wchar_t使用32位数据类型。在实践中,这有关系吗?我认为,在现实世界中,对于32位数据类型而言,任何系统是否会使用不同数量的物理内存
内存而不是16位数据?


-

NOUN:1。遗嘱遗留给他人的金钱或财产。 2.从祖先或前任或过去那里下来的东西:b
宗教自由的遗产。 ETYMOLOGY:MidE legacie,副手办公室,来自OF,来自ML legatia的
,来自L legare,以及deute,遗赠。 www.bartleby.com/61/


Steven T. Hatton写道:
Steven T. Hatton wrote:
Steven T. Hatton写道:
Steven T. Hatton wrote:
UTF-16似乎是一个很好的语言候选人Franka的运行时
字符
编码。 UTF-8在数据传输和存储环境中更有意义。据我所知,C ++
标准不要求实现提供UTF-16字符串类。这是正确的吗?
UTF-16 seems to be a good candidate for a lingua Franka of runtime
character
encodings. UTF-8 makes more sense in the context of data transmission and
storage. As far as I can tell, implementations are not required by the C++
Standard to provide a UTF-16 string class. Is that correct?



要回答我自己的问题,这似乎是部分正确的。
实现必须为
所有支持的语言环境所需的UTF-16子集提供支持。

我要求澄清为什么Xerces-C使用了不同的区域数据类型比wchar_t更能容纳UTF-16。一个人回答称,
标准没有规定wchar_t是16位。他指出GCC
为wchar_t使用32位数据类型。在实践中,这有关系吗?我的意思是,在现实世界中,对于32位数据类型,任何系统都会使用不同数量的物理存储器而不是16位数据吗?



To answer my own question, that appears to be partially correct. The
implementation must provide support for the subset of UTF-16 required by
all of its supported locales.

I asked for some clarification as to why Xerces-C uses a differnet data type
than wchar_t to hold UTF-16. One person responded by arguing that the
standard does not specify that wchar_t is 16 bits. He pointed out that GCC
uses a 32 bit data type for wchar_t. In practice, does that matter? I
mean, in the real world, will any system use a different amount of physical
memory for a 32 bit data type than for a 16 bit type?



是的。显然,每个现实世界都是系统将使用两倍的内存来存储32位字符串,而不是存储等效的带有16位字符的
字符串。正如人们所期望的那样,4字节字符串上的操作将需要两倍于两个字节字符串的等效操作。什么

会成为另外思考的基础?


所以基本上你最终会得到两倍大小的字符串

的速度是包含相同内容的16位字符串的两倍。

Greg



Yes. Clearly every "real world" system will use twice as much memory
storing the 32-bit character strings than it would storing equivalent
strings with 16-bit characters. Just as one would expect that
operations on the 4-byte character strings will require twice the
number of cycles as equivalent operations on the 2-byte strings. What
would be the basis for thinking otherwise?

So essentially you end up with strings that are twice the size and that
are twice as slow as 16-bit strings containing identical content.

Greg


Greg写道:
Greg wrote:
是的。显然,每个现实世界都是系统将使用两倍的内存来存储32位字符串,而不是存储具有16位字符的等效字符串。正如人们所期望的那样,对4字节字符串的操作将需要两倍于2字节字符串上的等效操作的周期数。什么
会成为另类思考的基础?


在32位系统上,每条指令处理的数据单位是32

位。这意味着在一个32位字中存储两个16位值将需要某种打包和解包。也许我错了,但我的理解是,在处理内存数据时,通常不会花费这样的处理器开销



所以基本上你最终会得到两倍大小的字符串,而且它们的速度是包含相同内容的16位字符串的两倍。
Yes. Clearly every "real world" system will use twice as much memory
storing the 32-bit character strings than it would storing equivalent
strings with 16-bit characters. Just as one would expect that
operations on the 4-byte character strings will require twice the
number of cycles as equivalent operations on the 2-byte strings. What
would be the basis for thinking otherwise?
On a 32 bit system the unit of data processed by each instruction is 32
bits. That means that storing two 16 bit values in one 32 bit word would
require some kind of packing and unpacking. Perhaps I am wrong, but my
understanding is that such processor overhead is typically not expended
when dealing with in-memory data.
So essentially you end up with strings that are twice the size and that
are twice as slow as 16-bit strings containing identical content.




你有没有基准示例来证明这一点?

-

NOUN:1。遗嘱遗留给他人的金钱或财产。 2.从祖先或前任或过去那里下来的东西:b
宗教自由的遗产。 ETYMOLOGY:MidE legacie,副手办公室,来自OF,来自ML legatia的
,来自L legare,以及deute,遗赠。 www.bartleby.com/61/


这篇关于UTF-16& wchar_t:关于C ++的第二个最糟糕的事情的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆