C语言中的UTF8处理 [英] UTF8 processing in C

查看:523
本文介绍了C语言中的UTF8处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对UTF8有基本的了解:代码点的长度是可变的,因此字符"可以是8位,16位甚至更长.

I have basic understanding of UTF8: code points have variable length, so a "character" can be 8 bits, 16 bits, or even longer.

我想知道的是,是否有一些C语言中的示例代码,库等与UTF8字符串(例如C中的标准库)具有相似的功能.告诉字符串的长度,等等.

What I'm wondering is if there some sample code, library, etc in C language that does similar things to an UTF8 string like standard library in C. E.g. tell the length of the string, etc.

谢谢

推荐答案

GNU确实有一个Unicode字符串库,称为 libunistring ,但它处理的内容几乎不及

GNU does have a Unicode string library, called libunistring, but it doesn’t handle anything nearly as well as ICU’s does.

例如,GNU库甚至不授予您访问归类的权限,而归类是所有字符串比较的基础.相比之下,ICU可以. ICU不能显示GNU的另一件事是Unicode正则表达式.为此,您可能想使用 Phil Hazel出色的C语言PCRE库,该库可以通过UTF-8支持进行编译

For example, the GNU library doesn’t even give you access to collation, which is the basis for all string comparison. In contrast, ICU does. Another thing that ICU has that GNU doesn’t appear is Unicode regexes. For that, you might like to use Phil Hazel’s excellent PCRE library for C, which can be compiled with UTF-8 support.

但是,GNU库可能足以满足您的需求.我不太喜欢它的API.很乱.如果您喜欢C编程,则可以尝试 Go编程语言,它具有出色的Unicode支持.这是一种新语言,但体积小巧,使用有趣.

However, it might be that the GNU library is enough for what you need. I don’t like its API much. Very messy. If you like C programming, you might try the Go programming language, which has excellent Unicode support. It’s a new language, but small and clean and fun to use.

另一方面,主要的解释语言(Perl,Python和Ruby)对Unicode的支持各不相同,这比您在C语言中获得的要好.在这些语言中,Perl的Unicode支持是最先进且最强大的.

On the other hand, the major interpreted languages — Perl, Python, and Ruby — all have varying support for Unicode that is better than you’ll ever get in C. Of those, Perl’s Unicode support is the most developed and robust.

请记住:仅支持更多字符是不够的.没有与之配套的规则,您就没有Unicode.最多,您可能具有ISO 10646:较大的字符库,但没有规则.我的口头禅是"Unicode不仅仅是更多的字符;还有更多的字符 一整套处理它们的规则."

Remember: it isn’t enough to support more characters. Without the rules that go with them, you don’t have Unicode. At most, you might have ISO 10646: a large character repertoire but no rules. My mantra is "Unicode isn’t just more characters; it’s more characters plus a whole bunch of rules for handling them."

这篇关于C语言中的UTF8处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆