C ++:如何在utf8中支持代理字符 [英] c++: How to support surrogate characters in utf8

查看:109
本文介绍了C ++:如何在utf8中支持代理字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个使用utf-8基本编码编写的应用程序,它支持utf-8 BMP(3字节).但是,有一个需要支持代理对的要求.

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs.

我在某处读到utf-8不支持代理字符.是真的吗?

I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?

如果是,请问如何使我的应用程序具有utf-16的默认编码而不是utf-8?

If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?

我没有代码片段,因为整个应用程序的编写都牢记utf-8而不是替代字符.

I don't have code snippet as the entire application is written by keeping utf-8 in mind and not surrogate characters.

为了获得utf-8中的代理对的支持,我需要在整个代码中更改哪些项目.或将默认编码更改为UTF-16.

What are the items that I would need to change in the entire code to get either the support of surrogate pairs in utf-8. Or changing the default encoding to UTF-16.

推荐答案

我们有一个使用utf-8基本编码编写的应用程序,它支持utf-8 BMP(3字节).

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes).

为什么没有整个Unicode指令集(4个字节)?为什么限制为仅3个字节? 3个字节可让您仅支持最多U + FFFF的代码点. 4个字节可为您提供额外的1048576个代码点的支持,一直到U + 10FFFF.

Why not the entire Unicode repertoire (4 bytes)? Why limited to only 3 bytes? 3 bytes gets you support for codepoints only up to U+FFFF. 4 bytes gets you support for an additional 1048576 codepoints, all the way up to U+10FFFF.

但是,在某些地方需要支持代理对.

However, there is a requirement where it needs to support Surrogate pairs.

代理对仅适用于UTF-16,不适用于UTF-8甚至UCS-2(UTF-16的前身).

Surrogate pairs only apply to UTF-16, not to UTF-8 or even UCS-2 (the predecessor to UTF-16).

我在某处读到utf-8不支持代理字符.是真的吗?

I have read somewhere that Surrogate characters are not supported in utf-8. Is it true?

用于编码替代项的代码点可以在UTF-8中进行物理编码,但是它们是Unicode标准保留的,并且在UTF-16编码之外非法使用. UTF-8不需要代理对,并且其中包含代理代码点的任何已解码Unicode字符串都应视为格式错误.

The codepoints that are used for encoding surrogates can be physically encoded in UTF-8, however they are reserved by the Unicode standard and are illegal to use outside of UTF-16 encoding. UTF-8 has no need for surrogate pairs, and any decoded Unicode string that contains surrogate codepoints in it should be considered malformed.

如果是,请问如何使我的应用程序具有utf-16的默认编码而不是utf-8?

If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8?

我们无法回答,因为您还没有提供有关项目设置,使用的编译器等的任何信息.

We can't answer that, since you have not provided any information about how your project is set up, what compiler you are using, etc.

但是,您不需要将应用程序切换到UTF-16.您只需要更新代码以支持UTF-8的4字节编码,并在将16位数据转换为UTF-8时确保支持代理对.不要将自己限制为U + FFFF作为可能的最高代码点. Unicode具有更多的代码点.

However, you don't need to switch the application to UTF-16. You just need to update your code to support the 4-byte encoding of UTF-8, and make sure you support surrogate pairs when converting 16-bit data to UTF-8. Don't limit yourself to U+FFFF as the highest possible codepoint. Unicode has many many more codepoints than that.

听起来像您的代码仅在将数据转换为UTF-8或从UTF-8转换数据时才处理UCS-2.只需更新该代码以支持UTF-16而不是UCS-2,就可以了.

It sounds like your code only handles UCS-2 when converting data to/from UTF-8. Just update that code to support UTF-16 instead of UCS-2, and you should be fine.

这篇关于C ++:如何在utf8中支持代理字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆