是否有将 Unicode 代码点转换为 UTF-8 的 C 库? [英] Is there a C library to convert Unicode code points to UTF-8?

查看:23
本文介绍了是否有将 Unicode 代码点转换为 UTF-8 的 C 库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须通过一些文本并根据字符模式编写 UTF-8 输出.我认为如果我可以使用代码点并将其转换为 UTF-8 会很容易.我一直在阅读有关 Unicode 和 UTF-8 的内容,但找不到好的解决方案.任何帮助将不胜感激.

I have to go through some text and write the UTF-8 output according to the character patterns. I thought it’ll be easy if I can work with the code points and get it converted to UTF-8. I have been reading about Unicode and UTF-8, but couldn’t find a good solution. Any help will be appreciated.

推荐答案

将 Unicode 代码点转换为 UTF-8 是如此微不足道,以至于调用库可能需要比自己动手更多的代码:

Converting Unicode code points to UTF-8 is so trivial that making the call to a library probably takes more code than just doing it yourself:

if (c<0x80) *b++=c;
else if (c<0x800) *b++=192+c/64, *b++=128+c%64;
else if (c-0xd800u<0x800) goto error;
else if (c<0x10000) *b++=224+c/4096, *b++=128+c/64%64, *b++=128+c%64;
else if (c<0x110000) *b++=240+c/262144, *b++=128+c/4096%64, *b++=128+c/64%64, *b++=128+c%64;
else goto error;

此外,自己动手意味着您可以将 api 调整为您需要的工作类型(一次一个字符?还是长字符串?)如果您知道您的输入是有效的 Unicode,您可以删除错误情况标量值.

Also, doing it yourself means you can tune the api to the type of work you need (character-at-a-time? Or long strings?) You can remove the error cases if you know your input is a valid Unicode scalar value.

另一个方向很难正确.我推荐一种有限自动机方法,而不是典型的位算术循环,这种循环有时会将无效序列解码为真实字符的别名(这是非常危险的,可能会导致安全问题).

The other direction is a good bit harder to get correct. I recommend a finite automaton approach rather than the typical bit-arithmetic loops that sometimes decode invalid sequences as aliases for real characters (which is very dangerous and can lead to security problems).

即使你最终选择了一个库,我认为你应该先尝试自己编写它,或者至少在进一步研究之前认真研究 UTF-8 规范.很多糟糕的设计可能来自将 UTF-8 视为黑盒,而重点是它不是黑盒,而是被创建为具有非常强大的属性,而太多的 UTF-8 新手没有看到这一点,直到他们自己做了很多工作.

Even if you do end up going with a library, I think you should either try writing it yourself first or at least seriously study the UTF-8 specification before going further. A lot of bad design can come from treating UTF-8 as a black box when the whole point is that it's not a black box but was created to have very powerful properties, and too many programmers new to UTF-8 fail to see this until they've worked with it a lot themselves.

这篇关于是否有将 Unicode 代码点转换为 UTF-8 的 C 库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆