修复由 UTF-8 和 Windows-1252 组成的文件 [英] Fixing a file consisting of both UTF-8 and Windows-1252

查看:43
本文介绍了修复由 UTF-8 和 Windows-1252 组成的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个生成 UTF-8 文件的应用程序,但其中一些内容的编码不正确.一些字符被编码为 iso-8859-1 aka iso-latin-1 或 cp1252 aka Windows-1252.有没有办法恢复原文?

I have an application that produces a UTF-8 file, but some of the contents are incorrectly encoded. Some of the characters are encoded as iso-8859-1 aka iso-latin-1 or cp1252 aka Windows-1252. Is there a way of recovering the original text?

推荐答案

是的!

显然,最好修复创建文件的程序,但这并不总是可行的.以下是两种解决方案.

Obviously, it's better to fix the program creating the file, but that's not always possible. What follows are two solutions.

Encoding::FixLatin 提供了一个名为 fix_latin 解码由 UTF-8、iso-8859-1、cp1252 和 US-ASCII 混合组成的文本.

Encoding::FixLatin provides a function named fix_latin which decodes text that consists of a mix of UTF-8, iso-8859-1, cp1252 and US-ASCII.

$ perl -e'
   use Encoding::FixLatin qw( fix_latin );
   $bytes = "xD0 x92 xD0x92
";
   $text = fix_latin($bytes);
   printf("U+%v04X
", $text);
'
U+00D0.0020.2019.0020.0412.000A

采用了启发式方法,但它们相当可靠.只有以下情况会失败:

Heuristics are employed, but they are fairly reliable. Only the following cases will fail:

  • 其中之一
    [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
    使用iso-8859编码-1 或 cp1252,后跟
    [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜š›œžŸ¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
    使用 iso-8859-1 或 cp1252 编码.

  • One of
    [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
    encoded using iso-8859-1 or cp1252, followed by one of
    [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
    encoded using iso-8859-1 or cp1252.

其中之一
[àáâãäåæçèéêëìíîï]
使用iso-8859-1或cp1252编码,后跟两个
[€‚ƒ„…†‡ˆ‰Š‹ŗ�‘’“”•–—˜™š›œx017E;Ÿ<NBSP>¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿]
使用iso-8859-1或cp1252编码.

One of
[àáâãäåæçèéêëìíîï]
encoded using iso-8859-1 or cp1252, followed by two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
encoded using iso-8859-1 or cp1252.

其中一个
[ðñòóôõö÷]
编码使用iso-8859-1 或 cp1252,后跟两个
[[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—&#x™š›œžŸ¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
使用 iso-8859-1 或 cp1252 编码.

One of
[ðñòóôõö÷]
encoded using iso-8859-1 or cp1252, followed by two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
encoded using iso-8859-1 or cp1252.

使用核心模块 Encode 可以产生相同的结果,尽管我认为这是一个公平的安装 Encoding::FixLatin::XS 后,比 Encoding::FixLatin 慢一点.

The same result can be produced using core module Encode, though I imagine this is a fair bit slower than Encoding::FixLatin with Encoding::FixLatin::XS installed.

$ perl -e'
   use Encode qw( decode_utf8 encode_utf8 decode );
   $bytes = "xD0 x92 xD0x92
";
   $text = decode_utf8($bytes, sub { encode_utf8(decode("cp1252", chr($_[0]))) });
   printf("U+%v04X
", $text);
'
U+00D0.0020.2019.0020.0412.000A

每一行只使用一种编码

fix_latin 在字符级别起作用.如果知道每一行完全使用 UTF-8、iso-8859-1、cp1252 或 US-ASCII 之一进行编码,您可以通过检查该行是否为有效的 UTF-8 来使该过程更加可靠.

Each line only uses one encoding

fix_latin works on a character level. If it's known that each line is entirely encoded using one of UTF-8, iso-8859-1, cp1252 or US-ASCII, you could make the process even more reliable by check if the line is valid UTF-8.

$ perl -e'
   use Encode qw( decode );
   for $bytes ("xD0 x92 xD0x92
", "xD0x92
") {
      if (!eval {
         $text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
         1  # No exception
      }) {
         $text = decode("cp1252", $bytes);
      }

      printf("U+%v04X
", $text);
   }
'
U+00D0.0020.2019.0020.00D0.2019.000A
U+0412.000A

采用了启发式方法,但它们非常可靠.仅当给定行的以下所有都为真时,它们才会失败:

Heuristics are employed, but they are very reliable. They will only fail if all of the following are true for a given line:

  • 该行使用 iso-8859-1 或 cp1252 编码,

  • The line is encoded using iso-8859-1 or cp1252,

至少一个
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜x2122;š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷]
存在于行中,

At least one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷]
is present in the line,


[ÀÁÂÃÄÅÆÇÈ&的所有实例#xC9;ÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
总是跟在后面其中之一
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™šx203A;œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿],

All instances of
[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
are always followed by exactly one of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],


[àáâãäåæçè&的所有实例#xE9;êëìíîï]
后面总是正好有两个
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘x2019;“”•–—˜™š›œž<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],

All instances of
[àáâãäåæçèéêëìíîï]
are always followed by exactly two of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],


[ðñòóôõö÷]
的所有实例都是总是紧跟三个
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™&x0161;›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿],

All instances of
[ðñòóôõö÷]
are always followed by exactly three of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],


[øùúûüýþÿ]
都不存在在行中,和

None of
[øùúûüýþÿ]
are present in the line, and

没有
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜š›œžŸ¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
出现在行中,除非前面提到过.

None of
[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
are present in the line except where previously mentioned.

注意事项:

  • Encoding::FixLatin 安装命令行工具 fix_latin 来转换文件,使用第二种方法编写一个就很简单了.
  • fix_latin(函数和文件)可以通过安装 编码::FixLatin::XS.
  • 同样的方法可用于 UTF-8 与其他单字节编码的混合.可靠性应该相似,但可能会有所不同.
  • Encoding::FixLatin installs command line tool fix_latin to convert files, and it would be trivial to write one using the second approach.
  • fix_latin (both the function and the file) can be sped up by installing Encoding::FixLatin::XS.
  • The same approach can be used for mixes of UTF-8 with other single-byte encodings. The reliability should be similar, but it can vary.

这篇关于修复由 UTF-8 和 Windows-1252 组成的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆