将不同的编码转换成ASCII码 [英] Convert different encodings to ascii

查看：302 发布时间：2017/8/17 1:38:44 python encoding character-encoding

本文介绍了将不同的编码转换成ASCII码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一百个文件，根据chardet，每个文件都使用以下之一编码：

I have a hundred files and according to chardet each file is encoded with one of the following:

['UTF-8', 'ascii', 'ISO-8859-2', 'UTF-16LE', 'TIS-620', 'utf-8', 'SHIFT_JIS', 'ISO-8859-7']

所以我知道文件编码，所以我知道用...打开文件的编码。

So I know the files encoding, therefore I know what encoding to open the file with.

我希望将所有文件转换为ascii。我还希望将不同版本的字符，如 - 和'转换为其纯ASCII字符。例如 b\xe2\x80\x94.decode（utf8）应转换为 - 。最重要的是文本很容易阅读。我不想要不要，而是不要。

I wish to convert all files to ascii only. I also wish to convert different versions of characters like - and ' to their plain ascii equivalents. For example b"\xe2\x80\x94".decode("utf8") should be converted to -. The most important thing is that the text is easy to read. I don't want don t for example, but rather don't instead.

我该怎么做？

我可以使用Python 2或3来解决这个问题。

I can use either Python 2 or 3 to solve this.

这是我所得到的Python2。我正在尝试检测那些连续的非ASCII字符开始的行。 os.listdir（'。'）中的file_name的

This is as far as I got for Python2. I'm trying to detect those lines which continua non ascii characters to begin with.

for file_name in os.listdir('.'):
        print(file_name)
        r = chardet.detect(open(file_name).read())
        charenc = r['encoding']
        with open(file_name,"r" ) as f:
            for line in f.readlines():
              if line.decode(charenc) != line.decode("ascii","ignore"):
                print(line.decode("ascii","ignore"))

这给了我以下例外：

    if line.decode(charenc) != line.decode("ascii","ignore"):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 6: truncated data

推荐答案

不要使用 .readlines（）一个二进制文件，字节行。在UTF-16，little-endian中，换行符编码为两个字节， 0A （ASCII换行）和 00 （一个NULL）。 .readlines（）分开这两个字节的第一个，留下不完整的数据进行解码。

Don't use .readlines() an a binary file with multi-byte lines. In UTF-16, little-endian, a newline is encoded as two bytes, 0A (in ASCII a newline) and 00 (a NULL). .readlines() splits on the first of those two bytes, leaving you with incomplete data to decode.

使用 io 库重新打开该文件，以方便解码：

Reopen the file with the io library for ease of decoding:

import io

for file_name in os.listdir('.'):
    print(file_name)
    r = chardet.detect(open(file_name).read())
    charenc = r['encoding']
    with io.open(file_name, "r", encoding=charenc) as f:
        for line in f:
            line = line.encode("ascii", "ignore"):
            print line

要用ASCII友好字符替换特定的unicode码点，请使用字典映射代码点到代码点或unicode字符串，并调用 line.translate（） 首先：

To replace specific unicode codepoints with ASCII-friendly characters, use a dictionary mapping codepoint to codepoint or unicode string and call line.translate() first:

charmap = {
    0x2014: u'-',   # em dash
    0x201D: u'"',   # comma quotation mark, double
    # etc.
}

line = line.translate(charmap)

我使用十六进制整数文字来定义从这里将从映射的unicode代码点。字典中的值必须是unicode字符串，整数（代码点）或无以完全删除该代码点。

I used hexadecimal integer literals to define the unicode codepoints to map from here. The value in the dictionary must be a unicode string, an integer (a codepoint) or None to delete that codepoint altogether.

这篇关于将不同的编码转换成ASCII码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将不同的编码转换成ASCII码 [英] Convert different encodings to ascii

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将不同的编码转换成ASCII码 [英] Convert different encodings to ascii

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭