修复损坏的编码（使用Python） [英] Fixing corrupt encoding (with Python)

查看：128 发布时间：2017/8/17 0:17:40 python encoding

本文介绍了修复损坏的编码（使用Python）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一堆文本文件包含韩文字符，编码错误。具体来说，这些字符似乎是使用EUC-KR编码的，但是这些文件本身就是使用UTF8 + BOM保存的。

I have bunch of text files contains Korean characters with wrong encodings. Specifically, it seems the characters are encoded with EUC-KR, but the files themselves were saved with UTF8+BOM.

到目前为止，我设法修复了一个包含以下内容的文件：

So far I managed to fix a file with the following:

使用EditPlus打开文件（显示文件编码为 UTF8 + BOM ）

在EditPlus中，将文件另存为 ANSI

最后，在Python中：

Open a file with EditPlus (it shows the file's encoding is UTF8+BOM)
In EditPlus, save the file as ANSI
Lastly, in Python:

with codecs.open(html, 'rb', encoding='euc-kr') as source_file:
    contents = source_file.read()

with open(html, 'w+b') as dest_file:
    dest_file.write(contents.encode('utf-8'))

我想自动化，不能这样做。我可以在Python中打开原始文件：

I want to automate this, but I have not been able to do so. I can open the original file in Python:

codecs.open(html, 'rb', encoding='utf-8-sig')

但是，我无法弄清楚如何做 2。部分。

However, I haven't been able to figure out how to do the 2. part.

推荐答案

我在这里假设你已经到EUC-KR，然后再编码为到UTF-8。如果是这样，编码为拉丁文1（Windows调用ANSI）确实是通过测试返回原始EUC-KR的最佳方法。

I am presuming here that you have text already encoded to EUC-KR, then encoded again to UTF-8. If so, encoding to Latin 1 (what Windows calls ANSI) is indeed the best way to get back to the original EUC-KR bytestring.

将文件打开为UTF8， BOM，编码为Latin1，解码为EUC-KR：

Open the file as UTF8 with BOM, encode to Latin1, decode as EUC-KR:

import io

with io.open(html, encoding='utf-8-sig') as infh:
    data = infh.read().encode('latin1').decode('euc-kr')

with io.open(html, 'w', encoding='utf8') as outfh:
    outfh.write(data)

我使用的是 io.open（）函数，而不是编解码器作为更强大的方法;

I'm using the io.open() function here instead of codecs as the more robust method; io is the new Python 3 library also backported to Python 2.

演示：

>>> broken = '\xef\xbb\xbf\xc2\xb9\xc3\x8c\xc2\xbc\xc3\xba'
>>> print broken.decode('utf-8-sig').encode('latin1').decode('euc-kr')
미술

这篇关于修复损坏的编码（使用Python）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

修复损坏的编码（使用Python） [英] Fixing corrupt encoding (with Python)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

修复损坏的编码（使用Python） [英] Fixing corrupt encoding (with Python)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭