转换一个UTF-8文本的wchar_t [英] Converting a UTF-8 text to wchar_t

查看:139
本文介绍了转换一个UTF-8文本的wchar_t的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这个问题已经被问了好几次在这里,我也读了一些问题的答案,但也有一些建议的解决方案和IM试图找出其中最好的。

I know this question has been asked quite a few times here, and i did read some of the answers, But there are a few suggested solutions and im trying to figure out the best of them.

我在写,基本上接收XML文本连接$ C $光盘UTF-8 C99应用程序。

I'm writing a C99 app that basically receives XML text encoded in UTF-8.

它的部分的工作是复制和操作字符串(找到SUBSTR,猫呢,恩..)

Part of it's job is to copy and manipulate that string (finding a substr, cat it, ex..)

正如我宁可不要马上使用外部没有标准库,即时尝试使用wchar_t的实现它。

As i would rather not to use an outside not-standard library right now, im trying to implement it using wchar_t.

目前,使用mbstowcs将其转换为即时通讯为wchar_t的操纵方便,而对于一些投入我试图用不同的语言 - 它工作得很好。

Currently, im using mbstowcs to convert it to wchar_t for easy manipulation, and for some input i tried in different languages - it worked fine.

的是,我也读了一些人在那里使用UTF-8和mbstowcs有一些问题,所以我想听到了这事是否使用允许/可以接受的。

Thing is, i did read some people out there had some issues with UTF-8 and mbstowcs, so i would like to hear out about whether this use is permitted/acceptable.

我面临着另一种选择是使用的iconv与wchar_t的参数。事情是,即时通讯平台(不是PC),它的语言环境上工作是非常非常的限制,只ANSI C语言环境。那个怎么样?

Other option i faced was using iconv with WCHAR_T parameter. Thing is, im working on a platform(not a PC) which it's locale is very very limit to only ANSI C locale. How about that?

我也遇到一些C ++库,很受欢迎。但IM有限C99实现。

I did also encounter some C++ library which is very popular. but im limited for C99 implementation.

另外,我将编制在另一个平台上,它为wchar_t的sizeof的是不同的(2字节与4字节我的机器上),这code。我该如何克服?使用固定大小的容器字符?但随后,该处理功能,我应该使用呢?

Also, i would be compiling this code on another platform, which the sizeof of wchar_t is different (2 bytes versus 4 bytes on my machine). How can i overcome that? using fixed-size char containers? but then, which manipulation functions should i use instead?

高兴听到一些想法。谢谢。

Happy to hear some thoughts. thanks.

推荐答案

C没有界定什么编码字符 wchar_t的类型和标准库只强制要求这两者之间转换的一些功能并未说明如何。如果实施相关的编码字符不是UTF-8则 mbstowcs 将导致数据损坏。

C does not define what encoding the char and wchar_t types are and the standard library only mandates some functions that translate between the two without saying how. If the implementation-dependent encoding of char is not UTF-8 then mbstowcs will result in data corruption.

为C99的理由标准

不过,五大功能往往过于严格,太原始开发管理字符便携式的国际计划。

However, the five functions are often too restrictive and too primitive to develop portable international programs that manage characters.

...

C90特意选择了不创造一个更完整的multibyte-和宽字符库,而是选择等待其自然发展的C社区获得宽字符更多的经验。

C90 deliberately chose not to invent a more complete multibyte- and wide-character library, choosing instead to await their natural development as the C community acquired more experience with wide characters.

从这里 采购。

所以,如果你在你的字符应用UTF-8的数据还没有一个标准的API的方式将其转换成 wchar_t的秒。

So, if you have UTF-8 data in your chars there isn't a standard API way to convert that to wchar_ts.

在我看来 wchar_t的通常应该被避免,除非必要的 - 你可能会,如果您使用的Win32 API,例如需要它。我不相信这将简化字符串处理。 wchar_t的始终是UTF-16LE在Windows上,所以你可能仍然需要有一个以上的 wchar_t的重新$ P $反正psent一个统一code code点。

In my opinion wchar_t should usually be avoided unless necessary - you might need it if you're using WIN32 APIs for example. I am not convinced it will simplify string manipulation. wchar_t is always UTF-16LE on Windows so you may still need to have more than one wchar_t to represent a single Unicode code point anyway.

我建议你调查 ICU项目 - 至少从教育的角度看

I suggest you investigate the ICU project - at least from an educational standpoint.

这篇关于转换一个UTF-8文本的wchar_t的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆