从损坏的ISO-Latin-1序列中恢复UTF-8 [英] Recovering UTF-8 from broken ISO-Latin-1 sequence

查看:87
本文介绍了从损坏的ISO-Latin-1序列中恢复UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近遇到了一些损坏的UTF-8字符串,这些字符串已转换为我认为是ISO-Latin-1的字符串,我想知道是否已经有一些工具可以用来自动转换回原来的版本,因为不会破坏任何信息,也不会丢失任何比特.

I have recently been encountering several broken UTF-8 strings that got converted to what I believe is ISO-Latin-1, and I was wondering if there was some tool out there already that could be used to convert back automatically, since no information is actually destroyed, no bits are actually lost.

基本上,类似这样的东西将采用一系列字符,并显示如果将相同的位显示为utf-8或其他某种编码,则它们将是什么.是否存在这样的工具? (我知道创建自己做的事情甚至是手动完成都会很容易,因此,如果真的没有,我可能会这样做.)

Essentially something like this would take a sequence of characters and display what they would have been if those same bits had been displayed as utf-8 or some other encoding. Does such a tool exist? (I know it would be easy to create something to do it myself, or even to just do it manually, so I will probably do that if there really isn't anything.)

要澄清一下:我遇到的特殊情况是,在特定论坛上,文本编辑器允许utf-8字符,但是论坛本身然后显示与utf-8字符的各个字节相对应的字符.

To clarify: The particular case I am having is that on a particular forum the text editor allows utf-8 characters, but the forum itself then displays the characters that correspond to the individual bytes of the utf-8 character.

对于字符U + 0000到U + 007F,它是完全相同的字符,但是:

For characters U+0000 to U+007F it is the exact same character, but:

  • U + 0080到U + 07FF字符将显示为U + 00C0和U + 00DF之间的一个字符,然后显示为U + 0080和U + 00BF之间的一个字符
  • U + 0800到U + FFFF字符将显示为U + 00E0和U + 00EF之间的一个字符,然后显示为U + 0080到U + 00BF之间的两个字符

以此类推...

因此,……"实际上应该显示为字符U + 2xy6(x是""的中间4位,y是""的最后2位加上"10").

So "�" should actually be displayed as the character U+2xy6, (x is the middle 4 bits of '�', y is the last 2 bits of '�' plus '10').

尽管我仍然无法确切找出U + 0080和U + 00BF'.'之间的哪个字符.

Although I still can't figure out exactly which of the characters between U+0080 and U+00BF '�' is.

我想做的是获取所有UTF-8字符串字符的ISO-Latin-1位值,将它们全部连接在一起,然后将结果位序列解释为好像它包含UTF-8编码的字符.

What I am trying to do is take all of a UTF-8 string's character's ISO-Latin-1 bit values, concatenate them all together, and interpret the resulting bit sequence as if it contained UTF-8 encoded characters.

推荐答案

很抱歉,但这没有任何意义. :)

Sorry to say, but this does not make a whole lot of sense. :)

场景1:像Héllöwörld" 这样的字符串包含了在UTF-8和Latin1中都有效的字符,已经正确地从UTF-8转换为Latin1:没问题.您现在只需要用Latin1解释它即可.

Scenario 1: A string like "Héllö wörld", which contains characters valid in both UTF-8 and Latin1, was properly converted from UTF-8 to Latin1: no problem. You just need to interpret it in Latin1 now.

场景2:像"Hello世界" 这样的字符串包含从UTF-8正确转换为Latin1的字符串,其中包含在UTF-8中有效但在Latin1中无效的字符:在这种情况下,这些字符在Latin1中无法表示的字符串可能已被?替换,即字符串现在为"Hello ??" ,您对此无能为力.

Scenario 2: A string like "Hello 世界", which contains characters valid in UTF-8 but not in Latin1, was properly converted from UTF-8 to Latin1: in this case, the characters which are not representable in Latin1 likely have been replaced by ?, i.e. the string is now "Hello ??" and there's nothing you can do about it.

场景3:像Héllö世界" 这样的字符串包含任何类型的字符并保存为UTF-8,已从假定的Latin1转换为UTF-8.这意味着字符已被误解,但现在已正确编码为UTF-8:Héllöä¸ç" .在这种情况下,您可以反转编码UTF-8→Latin1并将结果解释为UTF-8以获得原始值.

Scenario 3: A string like "Héllö 世界", which contains any sort of characters and was saved as UTF-8, was converted from assumed Latin1 to UTF-8. That means the characters have been misinterpreted but are now properly encoded UTF-8: "Héllö ä¸ç". In this case, you can reverse the encoding UTF-8 → Latin1 and interpret the result as UTF-8 to get the original back.

方案4:类似HéllöWörld" 的字符串包含拉丁1个字符并被保存为Latin1,被误解为UTF-8,然后另存为UTF-8,此时为地狱世界" .该字符串现在不可恢复.

Scenario 4: A string like "Héllö Wörld", which contains Latin1 characters and was saved as Latin1, was misinterpreted as UTF-8, then saved as UTF-8, in which case it's now "H�ll� W�rld". This string is now irrecoverable.

发生的事情还有更多可能的组合,没有更多的信息就无法确切告诉您可以做什么或不能做什么.首先,请确保 现在正确地解释了字符串,而不仅仅是显示问题.

There are many more possible combinations of what happened, it's impossible to tell you exactly what can or can't be done without more information. First of all, make sure you are interpreting the string correctly now and it's not simply a display issue.

您在其中看到."的事实表明您试图将某物解释为UTF-8,但是UTF-8解码器无法理解这些字符并将其替换为.".这要么是您现在的错,数据很好,要么是方案4.

The fact that you're seeing a "�" in there points towards that you are trying to interpret something as UTF-8, but the UTF-8 decoder can not make sense of these characters and replaces them with "�". This is either your fault now and the data is fine, or it's scenario 4.

这篇关于从损坏的ISO-Latin-1序列中恢复UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆