如何处理C / C ++中的unicode字符序列? [英] How to handle unicode character sequences in C/C++?

查看:114
本文介绍了如何处理C / C ++中的unicode字符序列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在C和C ++中处理unicode字符序列的便携式和干净的方法是什么?

What are the more portable and clean ways to handle unicode character sequences in C and C++ ?

此外,如何:

- 读取unicode字符串

-Read unicode strings

- 将Unicode字符串转换为ASCII以保存一些字节(如果用户只输入ASCII)

-Convert unicode strings to ASCII to save some bytes (if the user only inputs ASCII)

- 打印unicode字符串

-Print unicode strings

我应该使用环境吗?我已经阅读了关于LC_CTYPE的例子,我应该把它作为开发人员吗?

Should I use the environment too ? I've read about LC_CTYPE for example, should I care about it as a developer ?

推荐答案


什么是更便携和干净的
方法来处理unicode字符
序列C和C ++?

What are the more portable and clean ways to handle unicode character sequences in C and C++ ?

程序中的所有字符串都是 UTF-8 ,UTF-16或UTF-32 。如果由于某种原因需要使用非Unicode编码,请在输入和输出上进行转换。

Have all strings in your program be UTF-8, UTF-16, or UTF-32. If for some reason you need to work with a non-Unicode encoding, do the conversion on input and output.


读取unicode字符串


Read unicode strings

与读取ASCII文件相同的方式。但是仍然有很多非Unicode数据,因此您需要检查数据是否是 Unicode。如果不是(或者如果您的首选内部编码是UTF-32,则为UTF-8),您需要进行转换。

Same way you'd read an ASCII file. But there's still a lot of non-Unicode data around, so you'll want to check whether the data is Unicode. If it's not (or if it's UTF-8 when your preferred internal encoding is UTF-32), you'll need to convert it.



  • 可以通过BOM的存在检测到UTF-8和UTF-32。

  • 如果不是UTF编码,则可能在ISO-8859-1或Windows-1252中。


将unicode字符串转换为ASCII到
保存一些字节(如果用户只有
输入ASCII)

Convert unicode strings to ASCII to save some bytes (if the user only inputs ASCII)



< t。如果你的数据都是ASCII,那么UTF-8将占用完全相同的空间。如果不是,你将失去信息时,转换为ASCII。如果您关心保存字节。

Don't. If your data is all ASCII, then UTF-8 will take exactly the same amount of space. And if it isn't, you'll lose information when you convert to ASCII. If you care about saving bytes.


  • 选择最佳的UTF编码。对于字符U + 0000到U + 007F,UTF-8是最小的。对于字符U + 0800到U + FFFF,UTF-16是最小的。

  • 使用数据压缩,如gzip。有一个专为Unicode设计的SCSU编码,但我不知道它是多好。


打印unicode字符串

Print unicode strings

编写UTF-8与编写ASCII没有什么不同。

Writing UTF-8 is no different from writing ASCII.

除了在Windows命令提示符下,因为它仍然使用旧的OEM代码页。您可以在此处使用 WriteConsoleW 与UTF- 16个字符串。

Except at the Windows command prompt, because it still uses the old "OEM" code pages. There you can use WriteConsoleW with UTF-16 strings.


我应该使用环境吗?
我已经阅读过LC_CTYPE例如,
我应该把它作为开发人员
吗?

Should I use the environment too ? I've read about LC_CTYPE for example, should I care about it as a developer ?

LC_CTYPE 是从每一种语言都有自己的字符编码的时代开始的延续,因此它自己的 ctype.h 函数。今天, Unicode字符数据库会照顾到这一点。 Unicode的优点是它将字符编码处理与区域设置处理分开(除了针对立陶宛语,土耳其语和阿塞拜疆语的特殊大写/小写规则

LC_CTYPE is a holdover from the days when every language had its own character encoding, and thus its own ctype.h functions. Today, the Unicode Character Database takes care of that. The beauty of Unicode is that it separates character encoding handling from locale handling (except for the special uppercase/lowercase rules for Lithuanian, Turkish, and Azeri).

但是每种语言都有自己的排序规则和数字格式规则,所以你仍然需要区域设置。您需要将区域设置的字符编码设置为UTF-8。

But each language still has its own collation rules and number formatting rules, so you'll still need locales for those. And you'll need to set your locale's character encoding to UTF-8.

这篇关于如何处理C / C ++中的unicode字符序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆