我该如何处理C ++中的多字节字符集字符串? [英] How should I handle the multibyte char set string in C++?

查看:82
本文介绍了我该如何处理C ++中的多字节字符集字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,


我正在使用wstring(wchar_t)作为内部字符串编写程序。


问题是当我将具有不同编码的多字节字符串设置字符串

转换为wstring(这是
Win32中的Unicode,UCS-2LE(BMP)和Linux中的UCS4?)时引发。


我有2种方法可以完成这项工作:


1)使用std :: locale,设置std :: locale :: global ()并使用mbstowcs()和

wcstombs()进行转换。


2)使用平台相关函数来完成工作,例如libiconv in

Linux,或Win32中的MultiByToWideChar()和WideCharToMultiByte()。

乍一看,可能肯定会选择解决方案1)到

完成这项工作。因为它真的是C ++的优势,而且细节上,codecvt

facet实际上是通过在Linux中调用libiconv包装函数,并且

MultiByteToWideChar()或WideCharToMultiByte( )在Win32(通过不同的

STL实现)来完成真正的工作。(如果我的理解是

正确的话)。


但是,我有两个问题。


首先,我必须在转换之前设置全局语言环境。


有2个副作用,第一个效果就是当我做多个
线程程序时,更改全局设置会影响其他使用不同编码进行转换的
线程。是的,我可以锁定

的转换,但它没有任何意义,并且导致非常低的b / b
性能。


第二个效果是每次我设置std :: locale :: global()时消耗,b $ b消耗,创建一个语言环境对象并将其设置为全局语言环境不是

轻松的工作,确实会导致低性能。


第二个问题,看起来像系统相关的转换函数

比std :: locale支持更多的编码( )由每个STL

实施。例如,libiconv支持UCS-2LE编码,但g +

+'的locale()不支持它。 MultiByteToWideChar()支持UTF8

转换,但MSVC(8.0)的STL std :: locale()不支持.65001

代码第65001页是UTF8。


不同平台上的语言环境字符串不一样可能是第三个

问题,但我可以通过#ifdef轻松忽略它#endif。


所以,回到开始的问题,我该如何处理

C ++中的MBCS字符串?


谢谢。

解决方案

4月29日下午4:40,Dancefire< Dancef ... @ gmail.comwrote:


我正在使用wstring(wchar_t)作为内部字符串编写程序。


当我将具有不同编码的多字节字符串集合字符串

转换为wstring(这是Unicode, UCS-2LE(BMP)在
Win32和Linux中的UCS4中?)。


我有两种方法可以完成这项工作:


1 )使用std :: locale,设置std :: locale :: global()并使用mbstowcs()和

wcstombs()进行转换。



为什么不用std :: codecvt?您可以从

语言环境中获得的方面。


2)使用平台相关函数来完成工作,例如libiconv in <在Win32中,
Linux,或MultiByToWideChar()和WideCharToMultiByte()。


乍一看,可能肯定会选择解决方案1)

来完成这项工作。因为它真的是C ++的优势,而且细节上,codecvt

facet实际上是通过在Linux中调用libiconv包装函数,并且

MultiByteToWideChar()或WideCharToMultiByte( )在Win32(通过不同的

STL实现)来完成真正的工作。(如果我的理解是

正确)。


但是,我有2个问题。


首先,我必须在转换之前设置全局语言环境。



为什么?您可以从任何区域设置获得一个方面。这就是C ++语言环境优于C语言的优势。


[...]


第二个问题,看起来像系统相关的转换函数

支持比std :: locale()更多的编码,每个STL

实现。



这是C ++库实现的一个问题。质量

实现将支持系统上安装的所有代码集。


For例如,libiconv支持UCS-2LE编码,但是g ++的

locale()不支持它。 MultiByteToWideChar()支持

UTF8转换,但MSVC(8.0)的STL std :: locale()不支持
支持.65001对于代码页65001,它是UTF8。



找到可用的语言环境,工作可以是一些游戏:-)
游戏:-)。以及如何命名,如果你不在Unix下。


不同平台上的语言环境字符串不一样可能是第三个

问题,但我可以通过#ifdef #endif轻松忽略它。


所以,回到开始的问题,我应该如何处理

C ++中的MBCS字符串?



官方答案是std :: codecvt。在实践中,我滚动我的

拥有:-)。


-

James Kanze(Gabi Software)电子邮件:< a href =mailto:ja ********* @ gmail.com> ja ********* @ gmail.com

Conseils en信息东方物品/

Beratung in objektorientierter Datenverarbeitung

9placeSémard,78210 St.-Cyr-l''coco,France,+ 33(0)1 30 23 00 34


为什么不用std :: codecvt?您可以从


语言环境中获得的方面。



oops,我想念std :: codecvt。谢谢。


在我尝试了std :: codecvt后,我还有2个问题。


1)我们应该初始化mbstate_t变量吗?以及如何初始化

mbstate_t可移植和C ++方式?


我在网上看到的许多示例代码都没有初始化mbstate_t

变量。如:

http ://incubator.apache.org/stdcxx/d...cvt.html#sec12


std :: mbstate_t state;


使用Visual Studio 2005在MSDN中进行示例。


mbstate_t状态;


他们只是声明并使用它,从不分配

状态的任何初始值。我确实在VC80中遇到了一个问题而没有在我尝试时将状态初始化为

零(第一个字符总是在调试模式中大量增加,

跟进还可以) 。


但MSDN的在线版本会初始化mbstate_t变量:
http://msdn2.microsoft.com/en-us/lib...58(VS.80).aspx


mbstate_t state = {0};


我找到一个使用memset()的代码将所有范围设置为零,但我

不要以为它是c ++的方式。

我应该如何制作初始的便携式设备?


2)我可以通过

codecvt.length()知道codecvt.in()的wchar_t * buf长度,但我怎么知道
$ b $的char *缓冲区长度b codecvt.out()?


我可以将0指针传递给mbstowcs()或wcstombs()以获得我需要的输出缓冲区的长度

。但是我不知道怎么用

使用codecvt<>来做同样的事情。


例如,libiconv支持UCS-2LE编码,但是g ++'s
locale()不支持它。 MultiByteToWideChar()支持

UTF8转换,但MSVC(8.0)的STL std :: locale()不支持
支持.65001对于代码页65001,它是UTF8。



找到可用的区域设置,工作可以是一些比赛:-)
游戏:-)。以及如何命名,如果你不在Unix下。



我使用locale -l列出Linux中支持的所有语言环境字符串,并使用以下链接在Windows中查找语言环境字符串:

http://msdn2.microsoft.com/en-us/lib...78(vs .80).aspx


但是,我仍然无法处理UCS-2/UTF16在Linux或

" UTF8" /" UTF16"在Windows中由std :: locale。你知道怎么办这个吗?
$ block $ class =post_quotes>
>

官方答案是std ::的codecvt。在实践中,我滚动我的

拥有:-)。



再次感谢你帮助我。


4月30日凌晨4:56,Dancefire< Dancef ... @ gmail.comwrote:

[...]


1)我们应该初始化mbstate_t变量吗?以及如何初始化

mbstate_t可移植和C ++方式?


我在网上看到的许多示例代码都没有初始化mbstate_t

变量。如:

http ://incubator.apache.org/stdcxx/d...cvt.html#sec12

std :: mbstate_t state;



严格来说,你应该对状态进行零初始化。在Apache stdcxx文档中显示的简单示例中,它没有什么问题。

一般情况下,
状态必须清零(即表示

初始移位状态)或者是先前转换的结果。


我已经更正了示例程序来初始化状态变量,

请参阅: http:/ /svn.apache.org/viewvc?view=rev&revision=533806 。我会在接下来的文件中修复



>



[...]


mbstate_t state = {0};


我找到一个使用memset()的代码将所有范围都设置为零,但是我不会认为这是c ++的方式。 />
我应该如何制作初始便携版?



如下所示:


mbstate_t state = mbstate_t();


>

2)我可以通过

codecvt.length()了解codecvt.in()的wchar_t * buf长度,但我应该怎么知道char *缓冲区长度为

codecvt.out()?



codecvt :: length()返回extern_type字符的数量(即,

codecvt< wchar_t,char>的窄字符)。


>



[...]


但是,我仍然无法处理UCS-2/ " UTF16"在Linux或

" UTF8" /" UTF16"在Windows中由std :: locale。你知道我怎么能这样做?b $ b这个?



在Apache C ++标准库中,您可以使用

a codecvt_byname facet以名称UTF-8 @ UCS构建

作为参数,虽然文档页面上没有提到它:
http://incubator.apache.org/stdcxx/d...vt-byname.html

让我考虑添加它。


Hi, everyone,

I''m writing a program using wstring(wchar_t) as internal string.

The problem is raised when I convert the multibyte char set string
with different encoding to wstring(which is Unicode, UCS-2LE(BMP) in
Win32, and UCS4 in Linux?).

I have 2 ways to do the job:

1) use std::locale, set std::locale::global() and use mbstowcs() and
wcstombs() do the conversion.

2) use platform dependent functions to do the job, such as libiconv in
Linux, or MultiByteToWideChar() and WideCharToMultiByte() in Win32.

At first glance, it might be definitely to choose the solution 1) to
do the job. Since it''s really C++ favor, and in details, the codecvt
facet is actually wrap the function by calling libiconv in Linux, and
MultiByteToWideChar() or WideCharToMultiByte() in Win32 (by different
STL implementation) to do the real job.(if my understanding is
correct).

However, I have 2 problems.

First, I have to set the global locale before I do the conversion.

There are 2 side effects, the first effect is when I do the multi-
thread program, changing the global setting will affect the other
thread using different encoding to do the conversion. Yes, I can lock
the conversion, but it make no sense to do, and cause really low
performance.

The second effect is every time I set std::locale::global() is time
consuming, create a locale object and set it to global locale is not a
light job, it does cause a low performance.

Second problem, looks like the system dependent conversion functions
support much more encoding than std::locale() by each STL
implementation. For example, libiconv support UCS-2LE encoding, but g+
+''s locale() doesn''t support it. MultiByteToWideChar() support UTF8
conversion, but MSVC(8.0)''s STL std::locale() doesn''t support ".65001"
for code page 65001 which is UTF8.

The locale string is not same on different platform might be the third
problem, but I can easily ignore it by #ifdef #endif.

So, back to beginning question, how should I handle the MBCS string in
C++?

Thanks.

解决方案

On Apr 29, 4:40 pm, Dancefire <Dancef...@gmail.comwrote:

I''m writing a program using wstring(wchar_t) as internal string.

The problem is raised when I convert the multibyte char set string
with different encoding to wstring(which is Unicode, UCS-2LE(BMP) in
Win32, and UCS4 in Linux?).

I have 2 ways to do the job:

1) use std::locale, set std::locale::global() and use mbstowcs() and
wcstombs() do the conversion.

Why not std::codecvt? A facet which you can obtain from a
locale.

2) use platform dependent functions to do the job, such as libiconv in
Linux, or MultiByteToWideChar() and WideCharToMultiByte() in Win32.

At first glance, it might be definitely to choose the solution 1) to
do the job. Since it''s really C++ favor, and in details, the codecvt
facet is actually wrap the function by calling libiconv in Linux, and
MultiByteToWideChar() or WideCharToMultiByte() in Win32 (by different
STL implementation) to do the real job.(if my understanding is
correct).

However, I have 2 problems.

First, I have to set the global locale before I do the conversion.

Why? You can get a facet from any locale. That''s the one
advantage C++ locales have over the C stuff.

[...]

Second problem, looks like the system dependent conversion functions
support much more encoding than std::locale() by each STL
implementation.

That''s a problem with the C++ library implementation. A quality
implementation will support all of the code sets that are
installed on the system.

For example, libiconv support UCS-2LE encoding, but g++''s
locale() doesn''t support it. MultiByteToWideChar() support
UTF8 conversion, but MSVC(8.0)''s STL std::locale() doesn''t
support ".65001" for code page 65001 which is UTF8.

Finding what locales are available and work can be a bit of a
game:-). And how they are named, if you''re not under Unix.

The locale string is not same on different platform might be the third
problem, but I can easily ignore it by #ifdef #endif.

So, back to beginning question, how should I handle the MBCS string in
C++?

The official answer is std::codecvt. In practice, I roll my
own:-).

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l''école, France, +33 (0)1 30 23 00 34


Why not std::codecvt? A facet which you can obtain from a

locale.

oops, I miss the std::codecvt. Thank you.

After I tried std::codecvt, I have 2 more questions.

1) Should we initialize mbstate_t variable? And how to initialize the
mbstate_t portable and in C++ way?

Many sample code I saw on the net, didn''t initialize the mbstate_t
variable. Such as:

http://incubator.apache.org/stdcxx/d...cvt.html#sec12

std::mbstate_t state;

And sample in MSDN with Visual Studio 2005.

mbstate_t state;

They just declare it and use it, never assign any initial value to the
state. And I did get a problem in VC80 without initialize the state to
zero during I try (the first character always mass up in debug mode,
the follow up is ok).

But the online version of MSDN do initialize the mbstate_t variable:
http://msdn2.microsoft.com/en-us/lib...58(VS.80).aspx

mbstate_t state = {0};

And I do find a code using memset() to set all range to zero, but I
don''t think it''s c++''s way.
How should I make the initial portable?

2) I can know the wchar_t* buf length for codecvt.in() by
codecvt.length(), but how should I know the char * buffer length for
codecvt.out()?

I can pass 0 pointer to mbstowcs() or wcstombs() to get the length of
the output buffer I need. but I don''t know how to do the same thing by
using codecvt<>.

For example, libiconv support UCS-2LE encoding, but g++''s
locale() doesn''t support it. MultiByteToWideChar() support
UTF8 conversion, but MSVC(8.0)''s STL std::locale() doesn''t
support ".65001" for code page 65001 which is UTF8.


Finding what locales are available and work can be a bit of a
game:-). And how they are named, if you''re not under Unix.

I use "locale -l" list all the locale string supportted in Linux, and
use the following link to find the locale string in Windows:

http://msdn2.microsoft.com/en-us/lib...78(vs.80).aspx

However, I still cannot handle "UCS-2"/"UTF16" in Linux or
"UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
this?

>
The official answer is std::codecvt. In practice, I roll my
own:-).


Thanks again, you do help me.


On Apr 30, 4:56 am, Dancefire <Dancef...@gmail.comwrote:
[...]

1) Should we initialize mbstate_t variable? And how to initialize the
mbstate_t portable and in C++ way?

Many sample code I saw on the net, didn''t initialize the mbstate_t
variable. Such as:

http://incubator.apache.org/stdcxx/d...cvt.html#sec12

std::mbstate_t state;

Strictly speaking you should zero-initialize the state. It doesn''t
matter
in the trivial example shown in the Apache stdcxx documentation but
in general the state must be either zeroed out (i.e., to represent the
initial shift state) or be the result of a prior conversion.

I have corrected the example program to initialize the state variable,
see: http://svn.apache.org/viewvc?view=rev&revision=533806. I''ll fix
the docs next.

>

[...]

mbstate_t state = {0};

And I do find a code using memset() to set all range to zero, but I
don''t think it''s c++''s way.
How should I make the initial portable?

Like so:

mbstate_t state = mbstate_t ();

>
2) I can know the wchar_t* buf length for codecvt.in() by
codecvt.length(), but how should I know the char * buffer length for
codecvt.out()?

codecvt::length() returns the number of extern_type characters (i.e.,
narrow chars for codecvt<wchar_t, char>).

>

[...]

However, I still cannot handle "UCS-2"/"UTF16" in Linux or
"UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
this?

In the Apache C++ Standard Library you can do it using
a codecvt_byname facet constructed with the name "UTF-8@UCS"
as an argument, although it''s not mentioned on the documentation page:
http://incubator.apache.org/stdcxx/d...vt-byname.html
Let me look into adding it.


这篇关于我该如何处理C ++中的多字节字符集字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆