C ++中的中文等语言中的字符表示 [英] Character representation in languages such as chinese in C++

查看:287
本文介绍了C ++中的中文等语言中的字符表示的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我知道普通语言中的char大小为1个字节.在像中文这样的语言中,char的最大大小是多少.w_char是它的对应表示吗?

在此先感谢

Hi,

I know in normal language char size 1 byte.what is the maximum size of char in languages such as chinese.Is w_char is it''s corresponding representation?

Thanks in advance

推荐答案

看看ICU: ICU-国际组件Unicode [^ ]

ICU库为 Unicode [ wchar_t [ ^ ]仅支持更广泛的字符编码.

您可以使用UTF-8,UTF-16,UCS-32,GB 18030,代码页936等对中文字符进行编码.

最好的问候
Espen Harlinn
Take a look at ICU:ICU - International Components for Unicode[^]

The ICU libraries, provides first class support for Unicode[^], while wchar_t[^] only provides support for wider character encodings.

You can encode chineese characters in UTF-8, UTF-16, UCS-32, GB 18030, Code page 936, and others.

Best regards
Espen Harlinn


您知道"的内容不正确!

首先,一个字符占用多少字节并不是语言的纯粹特征.当然,没有普通"语言之类的东西.核心Unicode没有定义每个字符有多少个字节;它定义了代码点的集合以及作为文化现象的字符之间的对应关系,该字符从其具体字形中抽象出来,并且从数学意义上从其计算机表示形式中抽象出了从数学意义上理解的整数值上. >
称为UTF的编码定义了如何以字节为单位表示每个代码点.只有UTF-32的每个字符固定有4个字节.面向字节的UTF-8使用有趣的算法,该算法使一个字符占用1、2、3或4个字节,其实际长度取决于前一个字节的值,而UTF-16不是16位代码(! ),字符的长度可以是16位或32位(如果字节以代理对的形式表示在基本多语言平面(iBMP)之外,即两个16位字).另外,UTF-16和UTF-32编码可以是小端或大端.

现在,关于普通"语言.您要考虑哪种语言.也许是美式英语?全部用ASCII表示,代码点为0到127,对吗?再想一想!这取决于您认为的语言".在这种语言中使用完整的标点符号怎么样?例如,考虑破折号和引号的正确排版:—, – .尝试在键盘上键入它们.代码点是0x2013、0x2014、0x201C和0x201D.尝试将它们压缩到一个字节中-祝你好运!

参见 http://unicode.org/ [ ^ ], http://unicode.org/faq/utf_bom.html [ ^ ].

请不要做出虚假陈述,请先了解自己.

—SA
What you "know" is not true!

First, how much byte a character takes is not a pure characteristic of the language. And of course, there is not such thing as "normal" language. Core Unicode does not define how many bytes each character has; it defines the set of code points and a correspondence between a character as a cultural phenomena, abstracted from its concrete glyph and a set on integer values understood in its mathematical sense abstracted from its computer presentation.

Encodings called UTFs define how to represent each code point in byte. Only UTF-32 has fixed 4 bytes per characters. Byte-oriented UTF-8 uses interesting algorithm which makes a character take 1, 2, 3 or 4 bytes with the actual length depending in the value of previous byte(s), and UTF-16 is not a 16-bit code (!), a length of the character can be either 16 or 32 bits (in case of bytes outside Base Multi-lingual Plane (BMP) expressed in a surrogate pair — two 16-bit words). Also, UTF-16 and UTF-32 encodings can be little endian or big endian.

Now, about "normal" languages. Which language do you want to consider. American English perhaps? All expressed in ASCII, code points, 0 to 127, right? Think again! It depends on what you consider a "language". How about fully-fledged punctuation used in this language? Consider, for example, correct typography for dash and quotation marks: —, – " ". Try to type them in your keyboard. The code points are 0x2013, 0x2014, 0x201C and 0x201D. Try to squeeze them in one byte — good luck!

See http://unicode.org/[^], http://unicode.org/faq/utf_bom.html[^].

Please don''t make false statements, understand things by yourself first.

—SA


标准C函数始终对我有用-在使用中文(Big5),日文和韩文编码的字节流时.它们取决于语言环境和操作系统支持.

The standard C functions always worked for me - when working with Chinese (Big5), Japanese, and Korean encoded byte streams. They are dependent on locale and OS support.

#include <cstdlib>
#include <climits>
MB_LEN_MAX // Maximum size of multibyte character (any locale)
MB_CUR_MAX // Current maximum size supported
mblen()    // length of MB character
mbtowc()   // MB character to WC character
wctomb()   // WC character to MB character
mbstowcs() // MB string to WC string
wcstombs() // WC string to MB string


这篇关于C ++中的中文等语言中的字符表示的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆