我可以从Unicode字符串中获取单个规范UTF-8字符串吗? [英] Can I get a single canonical UTF-8 string from a Unicode string?

查看:45
本文介绍了我可以从Unicode字符串中获取单个规范UTF-8字符串吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个十二岁的Windows程序.众所周知,它是为ASCII字符而不是Unicode设计的.它的大多数已被转换,但仍有一个地方需要更改.但是有一个严重的约束:完全相同的 ASCII byte 序列必须由不同的编码器创建,其中一些将在非Windows系统.

I have a twelve-year-old Windows program. As may be obvious to the knowledgeable, it was designed for ASCII characters, not Unicode. Most of it has been converted, but there's one spot that still needs to be changed over. There is a serious constraint on it though: the exact same ASCII byte sequence MUST be created by different encoders, some of which will be operating on non-Windows systems.

我正在尝试确定UTF-8是否可以解决问题.我曾经听说过,不同的UTF-8序列可以使用相同的Unicode字符串,这在这里会是个问题.

I'm trying to determine whether UTF-8 will do the trick or not. I've heard in passing that different UTF-8 sequences can come up with the same Unicode string, which would be a problem here.

所以问题是:给定Unicode字符串,我可以期望转换器的任何符合标准的实现生成单个规范的UTF-8序列吗?还是有多种可能性?

So the question is: given a Unicode string, can I expect a single canonical UTF-8 sequence to be generated by any standards-conforming implementation of a converter? Or are there multiple possibilities?

推荐答案

任何给定的Unicode字符串在UTF-8中仅具有一种表示形式.

Any given Unicode string will have only one representation in UTF-8.

我认为这里的困惑在于,对于某些语言,Unicode有多种方式来获得相同的 visual 输出.更不用说Unicode有几个没有视觉表示的字符.

I think the confusion here is that there are multiple ways in Unicode to get the same visual output for some languages. Not to mention that Unicode has several characters that have no visual representation.

但这与UTF-8无关,它是Unicode本身的属性.给定Unicode的UTF-8编码是纯粹的机械过程,并且是完全可逆的.

But this has nothing to do with UTF-8, its a property of Unicode itself. The encoding of a given Unicode as UTF-8 is a purely mechanical process, and it's perfectly reversible.

转换规则在这里: http://en.wikipedia.org/wiki/UTF-8

这篇关于我可以从Unicode字符串中获取单个规范UTF-8字符串吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆