C ++:如何在utf8中支持代理字符 [英] c++: How to support surrogate characters in utf8

查看：109 发布时间：2020/7/13 5:07:26 c++ utf-8 internationalization utf-16 surrogate-pairs

本文介绍了C ++:如何在utf8中支持代理字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们有一个使用utf-8基本编码编写的应用程序，它支持utf-8 BMP(3字节).但是，有一个需要支持代理对的要求.

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs.

我在某处读到utf-8不支持代理字符.是真的吗?

I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?

如果是，请问如何使我的应用程序具有utf-16的默认编码而不是utf-8?

If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?

我没有代码片段，因为整个应用程序的编写都牢记utf-8而不是替代字符.

I don't have code snippet as the entire application is written by keeping utf-8 in mind and not surrogate characters.

为了获得utf-8中的代理对的支持，我需要在整个代码中更改哪些项目.或将默认编码更改为UTF-16.

What are the items that I would need to change in the entire code to get either the support of surrogate pairs in utf-8. Or changing the default encoding to UTF-16.

推荐答案

我们有一个使用utf-8基本编码编写的应用程序，它支持utf-8 BMP(3字节).

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes).

为什么没有整个Unicode指令集(4个字节)?为什么限制为仅3个字节? 3个字节可让您仅支持最多U + FFFF的代码点. 4个字节可为您提供额外的1048576个代码点的支持，一直到U + 10FFFF.

Why not the entire Unicode repertoire (4 bytes)? Why limited to only 3 bytes? 3 bytes gets you support for codepoints only up to U+FFFF. 4 bytes gets you support for an additional 1048576 codepoints, all the way up to U+10FFFF.

但是，在某些地方需要支持代理对.

However, there is a requirement where it needs to support Surrogate pairs.

代理对仅适用于UTF-16，不适用于UTF-8甚至UCS-2(UTF-16的前身).

Surrogate pairs only apply to UTF-16, not to UTF-8 or even UCS-2 (the predecessor to UTF-16).

我在某处读到utf-8不支持代理字符.是真的吗?

I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?

用于编码替代项的代码点可以在UTF-8中进行物理编码，但是它们是Unicode标准保留的，并且在UTF-16编码之外非法使用. UTF-8不需要代理对，并且其中包含代理代码点的任何已解码Unicode字符串都应视为格式错误.

The codepoints that are used for encoding surrogates can be physically encoded in UTF-8, however they are reserved by the Unicode standard and are illegal to use outside of UTF-16 encoding. UTF-8 has no need for surrogate pairs, and any decoded Unicode string that contains surrogate codepoints in it should be considered malformed.

如果是，请问如何使我的应用程序具有utf-16的默认编码而不是utf-8?

If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?

我们无法回答，因为您还没有提供有关项目设置，使用的编译器等的任何信息.

We can't answer that, since you have not provided any information about how your project is set up, what compiler you are using, etc.

但是，您不需要将应用程序切换到UTF-16.您只需要更新代码以支持UTF-8的4字节编码，并在将16位数据转换为UTF-8时确保支持代理对.不要将自己限制为U + FFFF作为可能的最高代码点. Unicode具有更多的代码点.

However, you don't need to switch the application to UTF-16. You just need to update your code to support the 4-byte encoding of UTF-8, and make sure you support surrogate pairs when converting 16-bit data to UTF-8. Don't limit yourself to U+FFFF as the highest possible codepoint. Unicode has many many more codepoints than that.

听起来像您的代码仅在将数据转换为UTF-8或从UTF-8转换数据时才处理UCS-2.只需更新该代码以支持UTF-16而不是UCS-2，就可以了.

It sounds like your code only handles UCS-2 when converting data to/from UTF-8. Just update that code to support UTF-16 instead of UCS-2, and you should be fine.

这篇关于C ++:如何在utf8中支持代理字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

C ++:如何在utf8中支持代理字符 [英] c++: How to support surrogate characters in utf8

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

C ++:如何在utf8中支持代理字符 [英] c++: How to support surrogate characters in utf8

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭