如何处理在C / C ++ UNI code字符序列? [英] How to handle unicode character sequences in C/C++?

查看:110
本文介绍了如何处理在C / C ++ UNI code字符序列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是更轻便和​​清洁的方式来处理在C和C ++ UNI code字符序列?

此外,如何:

- 读取UNI code字符串

- 转换UNI code字符串为ASCII节省一些字节(如果用户仅输入ASCII)

- 打印UNI code字符串

我应该使用环境吗?我读过有关LC_CTYPE例如,我应该关心它作为一个开发者?


解决方案

  

什么是更轻便和​​清洁
  的方式来处理UNI code字
  在C和C ++的序列?


有在程序中所有的字符串是的 UTF-8,UTF-16或UTF-32 。如果由于某种原因,你需要使用非统一code编码工作,做输入和输出的转换。


  

阅读UNI code字符串


同样的方式,你会读出的ASCII文件。但还是有很多不统一code数据的周围,所以你要检查是否有数据的的统一code。如果它不是(或者如果它是UTF-8,当你preferred内部编码是UTF-32),你需要将其转换。


  • UTF-8和UTF-32可以通过验证被可靠地检测。

  • UTF-16可以通过BOM的presence被检测到。

  • 如果它不是一个UTF编码,很可能在ISO 8859或Windows 1252。


  

转换UNI code字符串ASCII到
  节省一些字节(如果用户只
  输入ASCII)


不要。如果你的数据是所有的ASCII,则UTF-8将采取完全相同的空间是相同的。如果不是的话,你会当你转换为ASCII丢失信息。如果你关心节省字节。


  • 选择最佳的UTF编码。对于字符U + 0000至U + 007F,UTF-8是最小的。对于字符U + 0800至U + FFFF,UTF-16是最小的。

  • 使用数据通信pression器好像gzip。有一个SCSU编码专为统一code设计的,但我不知道它有多好。


  

打印UNI code字符串


写作UTF-8是没有从写ASCII不同。

除了在Windows命令提示符,因为它仍然使用旧的OEMcode页面。在那里,你可以使用 WriteConsoleW 使用UTF-16字符串。


  

我应该使用环境吗?
  我读过有关LC_CTYPE例如,
  我应该关心它作为一个开发商
  ?


LC_CTYPE 是天缓缴时,每一种语言都有自己的字符编码​​,因而其自身的文件ctype.h 功能。如今,统一code字符数据库需要的照顾。统一code的优点在于它的分隔的字符编码​​从现场装卸搬运(除的特别大/小写的规则,立陶宛,土耳其,和阿塞拜疆)。

但每种语言还是有自己的排序规则和数字格式的规则,所以你还需要为这些语言环境。你会需要你的语言环境的字符编码​​设置为UTF-8。

What are the more portable and clean ways to handle unicode character sequences in C and C++ ?

Moreover, how to:

-Read unicode strings

-Convert unicode strings to ASCII to save some bytes (if the user only inputs ASCII)

-Print unicode strings

Should I use the environment too ? I've read about LC_CTYPE for example, should I care about it as a developer ?

解决方案

What are the more portable and clean ways to handle unicode character sequences in C and C++ ?

Have all strings in your program be UTF-8, UTF-16, or UTF-32. If for some reason you need to work with a non-Unicode encoding, do the conversion on input and output.

Read unicode strings

Same way you'd read an ASCII file. But there's still a lot of non-Unicode data around, so you'll want to check whether the data is Unicode. If it's not (or if it's UTF-8 when your preferred internal encoding is UTF-32), you'll need to convert it.

  • UTF-8 and UTF-32 can be reliably detected by validation.
  • UTF-16 can be detected by the presence of a BOM.
  • If it's not a UTF encoding, it's likely in ISO-8859-1 or windows-1252.

Convert unicode strings to ASCII to save some bytes (if the user only inputs ASCII)

Don't. If your data is all ASCII, then UTF-8 will take exactly the same amount of space. And if it isn't, you'll lose information when you convert to ASCII. If you care about saving bytes.

  • Choose the optimal UTF encoding. For characters U+0000 to U+007F, UTF-8 is the smallest. For characters U+0800 to U+FFFF, UTF-16 is the smallest.
  • Use data compression like gzip. There is a SCSU encoding specifically designed for Unicode, but I don't know how good it is.

Print unicode strings

Writing UTF-8 is no different from writing ASCII.

Except at the Windows command prompt, because it still uses the old "OEM" code pages. There you can use WriteConsoleW with UTF-16 strings.

Should I use the environment too ? I've read about LC_CTYPE for example, should I care about it as a developer ?

LC_CTYPE is a holdover from the days when every language had its own character encoding, and thus its own ctype.h functions. Today, the Unicode Character Database takes care of that. The beauty of Unicode is that it separates character encoding handling from locale handling (except for the special uppercase/lowercase rules for Lithuanian, Turkish, and Azeri).

But each language still has its own collation rules and number formatting rules, so you'll still need locales for those. And you'll need to set your locale's character encoding to UTF-8.

这篇关于如何处理在C / C ++ UNI code字符序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆