将不同的编码转换成ASCII码 [英] Convert different encodings to ascii

查看:302
本文介绍了将不同的编码转换成ASCII码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一百个文件,根据chardet,每个文件都使用以下之一编码:

I have a hundred files and according to chardet each file is encoded with one of the following:

['UTF-8', 'ascii', 'ISO-8859-2', 'UTF-16LE', 'TIS-620', 'utf-8', 'SHIFT_JIS', 'ISO-8859-7']

所以我知道文件编码,所以我知道用...打开文件的编码。

So I know the files encoding, therefore I know what encoding to open the file with.

我希望将所有文件转换为ascii。我还希望将不同版本的字符,如 - '转换为其纯ASCII字符。例如 b\xe2\x80\x94.decode(utf8)应转换为 - 。最重要的是文本很容易阅读。我不想要不要,而是不要

I wish to convert all files to ascii only. I also wish to convert different versions of characters like - and ' to their plain ascii equivalents. For example b"\xe2\x80\x94".decode("utf8") should be converted to -. The most important thing is that the text is easy to read. I don't want don t for example, but rather don't instead.

我该怎么做?

我可以使用Python 2或3来解决这个问题。

I can use either Python 2 or 3 to solve this.

这是我所得到的Python2。我正在尝试检测那些连续的非ASCII字符开始的行。 os.listdir('。')中的file_name的

This is as far as I got for Python2. I'm trying to detect those lines which continua non ascii characters to begin with.

for file_name in os.listdir('.'):
        print(file_name)
        r = chardet.detect(open(file_name).read())
        charenc = r['encoding']
        with open(file_name,"r" ) as f:
            for line in f.readlines():
              if line.decode(charenc) != line.decode("ascii","ignore"):
                print(line.decode("ascii","ignore"))

这给了我以下例外:

    if line.decode(charenc) != line.decode("ascii","ignore"):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 6: truncated data


推荐答案

不要使用 .readlines()一个二进制文件,字节行。在UTF-16,little-endian中,换行符编码为两个字节, 0A (ASCII换行)和 00 (一个NULL)。 .readlines()分开这两个字节的第一个,留下不完整的数据进行解码。

Don't use .readlines() an a binary file with multi-byte lines. In UTF-16, little-endian, a newline is encoded as two bytes, 0A (in ASCII a newline) and 00 (a NULL). .readlines() splits on the first of those two bytes, leaving you with incomplete data to decode.

使用 io 库重新打开该文件,以方便解码:

Reopen the file with the io library for ease of decoding:

import io

for file_name in os.listdir('.'):
    print(file_name)
    r = chardet.detect(open(file_name).read())
    charenc = r['encoding']
    with io.open(file_name, "r", encoding=charenc) as f:
        for line in f:
            line = line.encode("ascii", "ignore"):
            print line

要用ASCII友好字符替换特定的unicode码点,请使用字典映射代码点到代码点或unicode字符串,并调用 line.translate() 首先:

To replace specific unicode codepoints with ASCII-friendly characters, use a dictionary mapping codepoint to codepoint or unicode string and call line.translate() first:

charmap = {
    0x2014: u'-',   # em dash
    0x201D: u'"',   # comma quotation mark, double
    # etc.
}

line = line.translate(charmap)

我使用十六进制整数文字来定义从这里将映射的unicode代码点。字典中的值必须是unicode字符串,整数(代码点)或以完全删除该代码点。

I used hexadecimal integer literals to define the unicode codepoints to map from here. The value in the dictionary must be a unicode string, an integer (a codepoint) or None to delete that codepoint altogether.

这篇关于将不同的编码转换成ASCII码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆