C ++:如何在utf8中支持代理字符 [英] c++: How to support surrogate characters in utf8
问题描述
我们有一个使用utf-8基本编码编写的应用程序,它支持utf-8 BMP(3字节).但是,有一个需要支持代理对的要求.
We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs.
我在某处读到utf-8不支持代理字符.是真的吗?
I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?
如果是,请问如何使我的应用程序具有utf-16的默认编码而不是utf-8?
If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?
我没有代码片段,因为整个应用程序的编写都牢记utf-8而不是替代字符.
I don't have code snippet as the entire application is written by keeping utf-8 in mind and not surrogate characters.
为了获得utf-8中的代理对的支持,我需要在整个代码中更改哪些项目.或将默认编码更改为UTF-16.
What are the items that I would need to change in the entire code to get either the support of surrogate pairs in utf-8. Or changing the default encoding to UTF-16.
推荐答案
我们有一个使用utf-8基本编码编写的应用程序,它支持utf-8 BMP(3字节).
We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes).
为什么没有整个Unicode指令集(4个字节)?为什么限制为仅3个字节? 3个字节可让您仅支持最多U + FFFF的代码点. 4个字节可为您提供额外的1048576个代码点的支持,一直到U + 10FFFF.
Why not the entire Unicode repertoire (4 bytes)? Why limited to only 3 bytes? 3 bytes gets you support for codepoints only up to U+FFFF. 4 bytes gets you support for an additional 1048576 codepoints, all the way up to U+10FFFF.
但是,在某些地方需要支持代理对.
However, there is a requirement where it needs to support Surrogate pairs.
代理对仅适用于UTF-16,不适用于UTF-8甚至UCS-2(UTF-16的前身).
Surrogate pairs only apply to UTF-16, not to UTF-8 or even UCS-2 (the predecessor to UTF-16).
我在某处读到utf-8不支持代理字符.是真的吗?
I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?
用于编码替代项的代码点可以在UTF-8中进行物理编码,但是它们是Unicode标准保留的,并且在UTF-16编码之外非法使用. UTF-8不需要代理对,并且其中包含代理代码点的任何已解码Unicode字符串都应视为格式错误.
The codepoints that are used for encoding surrogates can be physically encoded in UTF-8, however they are reserved by the Unicode standard and are illegal to use outside of UTF-16 encoding. UTF-8 has no need for surrogate pairs, and any decoded Unicode string that contains surrogate codepoints in it should be considered malformed.
如果是,请问如何使我的应用程序具有utf-16的默认编码而不是utf-8?
If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?
我们无法回答,因为您还没有提供有关项目设置,使用的编译器等的任何信息.
We can't answer that, since you have not provided any information about how your project is set up, what compiler you are using, etc.
但是,您不需要将应用程序切换到UTF-16.您只需要更新代码以支持UTF-8的4字节编码,并在将16位数据转换为UTF-8时确保支持代理对.不要将自己限制为U + FFFF作为可能的最高代码点. Unicode具有更多的代码点.
However, you don't need to switch the application to UTF-16. You just need to update your code to support the 4-byte encoding of UTF-8, and make sure you support surrogate pairs when converting 16-bit data to UTF-8. Don't limit yourself to U+FFFF as the highest possible codepoint. Unicode has many many more codepoints than that.
听起来像您的代码仅在将数据转换为UTF-8或从UTF-8转换数据时才处理UCS-2.只需更新该代码以支持UTF-16而不是UCS-2,就可以了.
It sounds like your code only handles UCS-2 when converting data to/from UTF-8. Just update that code to support UTF-16 instead of UCS-2, and you should be fine.
这篇关于C ++:如何在utf8中支持代理字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!