修复损坏的编码(使用Python) [英] Fixing corrupt encoding (with Python)
问题描述
我有一堆文本文件包含韩文字符,编码错误。具体来说,这些字符似乎是使用EUC-KR编码的,但是这些文件本身就是使用UTF8 + BOM保存的。
I have bunch of text files contains Korean characters with wrong encodings. Specifically, it seems the characters are encoded with EUC-KR, but the files themselves were saved with UTF8+BOM.
到目前为止,我设法修复了一个包含以下内容的文件:
So far I managed to fix a file with the following:
- 使用EditPlus打开文件(显示文件编码为
UTF8 + BOM
) - 在EditPlus中,将文件另存为
ANSI
-
最后,在Python中:
- Open a file with EditPlus (it shows the file's encoding is
UTF8+BOM
) - In EditPlus, save the file as
ANSI
Lastly, in Python:
with codecs.open(html, 'rb', encoding='euc-kr') as source_file:
contents = source_file.read()
with open(html, 'w+b') as dest_file:
dest_file.write(contents.encode('utf-8'))
我想自动化,不能这样做。我可以在Python中打开原始文件:
I want to automate this, but I have not been able to do so. I can open the original file in Python:
codecs.open(html, 'rb', encoding='utf-8-sig')
但是,我无法弄清楚如何做 2。部分。
However, I haven't been able to figure out how to do the 2. part.
推荐答案
我在这里假设你已经到EUC-KR,然后再编码为到UTF-8。如果是这样,编码为拉丁文1(Windows调用ANSI)确实是通过测试返回原始EUC-KR的最佳方法。
I am presuming here that you have text already encoded to EUC-KR, then encoded again to UTF-8. If so, encoding to Latin 1 (what Windows calls ANSI) is indeed the best way to get back to the original EUC-KR bytestring.
将文件打开为UTF8, BOM,编码为Latin1,解码为EUC-KR:
Open the file as UTF8 with BOM, encode to Latin1, decode as EUC-KR:
import io
with io.open(html, encoding='utf-8-sig') as infh:
data = infh.read().encode('latin1').decode('euc-kr')
with io.open(html, 'w', encoding='utf8') as outfh:
outfh.write(data)
我使用的是 io.open()
函数,而不是编解码器
作为更强大的方法;
I'm using the io.open()
function here instead of codecs
as the more robust method; io
is the new Python 3 library also backported to Python 2.
演示:
>>> broken = '\xef\xbb\xbf\xc2\xb9\xc3\x8c\xc2\xbc\xc3\xba'
>>> print broken.decode('utf-8-sig').encode('latin1').decode('euc-kr')
미술
这篇关于修复损坏的编码(使用Python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!