是否有将 Unicode 代码点转换为 UTF-8 的 C 库? [英] Is there a C library to convert Unicode code points to UTF-8?

查看：23 发布时间：2021/9/15 19:40:02 c unicode utf-8

本文介绍了是否有将 Unicode 代码点转换为 UTF-8 的 C 库?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我必须通过一些文本并根据字符模式编写 UTF-8 输出.我认为如果我可以使用代码点并将其转换为 UTF-8 会很容易.我一直在阅读有关 Unicode 和 UTF-8 的内容，但找不到好的解决方案.任何帮助将不胜感激.

I have to go through some text and write the UTF-8 output according to the character patterns. I thought it’ll be easy if I can work with the code points and get it converted to UTF-8. I have been reading about Unicode and UTF-8, but couldn’t find a good solution. Any help will be appreciated.

推荐答案

将 Unicode 代码点转换为 UTF-8 是如此微不足道，以至于调用库可能需要比自己动手更多的代码:

Converting Unicode code points to UTF-8 is so trivial that making the call to a library probably takes more code than just doing it yourself:

if (c<0x80) *b++=c;
else if (c<0x800) *b++=192+c/64, *b++=128+c%64;
else if (c-0xd800u<0x800) goto error;
else if (c<0x10000) *b++=224+c/4096, *b++=128+c/64%64, *b++=128+c%64;
else if (c<0x110000) *b++=240+c/262144, *b++=128+c/4096%64, *b++=128+c/64%64, *b++=128+c%64;
else goto error;

此外，自己动手意味着您可以将 api 调整为您需要的工作类型(一次一个字符?还是长字符串?)如果您知道您的输入是有效的 Unicode，您可以删除错误情况标量值.

Also, doing it yourself means you can tune the api to the type of work you need (character-at-a-time? Or long strings?) You can remove the error cases if you know your input is a valid Unicode scalar value.

另一个方向很难正确.我推荐一种有限自动机方法，而不是典型的位算术循环，这种循环有时会将无效序列解码为真实字符的别名(这是非常危险的，可能会导致安全问题).

The other direction is a good bit harder to get correct. I recommend a finite automaton approach rather than the typical bit-arithmetic loops that sometimes decode invalid sequences as aliases for real characters (which is very dangerous and can lead to security problems).

即使你最终选择了一个库，我认为你应该先尝试自己编写它，或者至少在进一步研究之前认真研究 UTF-8 规范.很多糟糕的设计可能来自将 UTF-8 视为黑盒，而重点是它不是黑盒，而是被创建为具有非常强大的属性，而太多的 UTF-8 新手没有看到这一点，直到他们自己做了很多工作.

Even if you do end up going with a library, I think you should either try writing it yourself first or at least seriously study the UTF-8 specification before going further. A lot of bad design can come from treating UTF-8 as a black box when the whole point is that it's not a black box but was created to have very powerful properties, and too many programmers new to UTF-8 fail to see this until they've worked with it a lot themselves.

这篇关于是否有将 Unicode 代码点转换为 UTF-8 的 C 库?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

是否有将 Unicode 代码点转换为 UTF-8 的 C 库? [英] Is there a C library to convert Unicode code points to UTF-8?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

是否有将 Unicode 代码点转换为 UTF-8 的 C 库? [英] Is there a C library to convert Unicode code points to UTF-8?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭