同一个字符可以有2种不同的UTF-8编码? [英] Can there be 2 different UTF-8 encodings for the same character?

查看:121
本文介绍了同一个字符可以有2种不同的UTF-8编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个应用程序,需要将其输入从UTF-8转换为ISO-8859-1(拉丁语1)。

I'm writing an application that needs to transcode its input from UTF-8 to ISO-8859-1 (Latin 1).

我有时会得到奇怪的编码为一些变音字符。例如,具有2个点(0xEB)的拉丁语1E通常以UTF-8 0xC3 0xAB形式出现,但有时也作为0xC3 0x83 0xC2 0xAB。

All works fine, except I sometimes get strange encodings for some umlaut characters. For example the Latin 1 E with 2 dots (0xEB) usually comes as UTF-8 0xC3 0xAB, but sometimes also as 0xC3 0x83 0xC2 0xAB.

来自不同的来源并注意到第一个和最后一个字符匹配我的期望,可能有一个编码规则,我的图书馆不知道吗?

This happened a number of times from different sources and noting that first and last characters match what I expect, could there be an encoding rule that my library doesn't know about ?

推荐答案

$ "\xC3\x83\xC2\xAB"
ë
$ use Encode

$ decode 'UTF-8', "\xC3\x83\xC2\xAB"
ë

您有双重编码的UTF-8。 Encode :: Repair 是一种处理方式。

You have double-encoded UTF-8. Encode::Repair is one way to deal with that.

这篇关于同一个字符可以有2种不同的UTF-8编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆