同一个字符可以有2种不同的UTF-8编码? [英] Can there be 2 different UTF-8 encodings for the same character?
问题描述
我正在编写一个应用程序,需要将其输入从UTF-8转换为ISO-8859-1(拉丁语1)。
I'm writing an application that needs to transcode its input from UTF-8 to ISO-8859-1 (Latin 1).
我有时会得到奇怪的编码为一些变音字符。例如,具有2个点(0xEB)的拉丁语1E通常以UTF-8 0xC3 0xAB形式出现,但有时也作为0xC3 0x83 0xC2 0xAB。
All works fine, except I sometimes get strange encodings for some umlaut characters. For example the Latin 1 E with 2 dots (0xEB) usually comes as UTF-8 0xC3 0xAB, but sometimes also as 0xC3 0x83 0xC2 0xAB.
来自不同的来源并注意到第一个和最后一个字符匹配我的期望,可能有一个编码规则,我的图书馆不知道吗?
This happened a number of times from different sources and noting that first and last characters match what I expect, could there be an encoding rule that my library doesn't know about ?
推荐答案
$ "\xC3\x83\xC2\xAB"
ë
$ use Encode
$ decode 'UTF-8', "\xC3\x83\xC2\xAB"
ë
您有双重编码的UTF-8。 Encode :: Repair 是一种处理方式。
You have double-encoded UTF-8. Encode::Repair is one way to deal with that.
这篇关于同一个字符可以有2种不同的UTF-8编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!