将不同的编码转换成ASCII码 [英] Convert different encodings to ascii
问题描述
我有一百个文件,根据chardet,每个文件都使用以下之一编码:
I have a hundred files and according to chardet each file is encoded with one of the following:
['UTF-8', 'ascii', 'ISO-8859-2', 'UTF-16LE', 'TIS-620', 'utf-8', 'SHIFT_JIS', 'ISO-8859-7']
所以我知道文件编码,所以我知道用...打开文件的编码。
So I know the files encoding, therefore I know what encoding to open the file with.
我希望将所有文件转换为ascii。我还希望将不同版本的字符,如 -
和'
转换为其纯ASCII字符。例如 b\xe2\x80\x94.decode(utf8)
应转换为 -
。最重要的是文本很容易阅读。我不想要不要
,而是不要
。
I wish to convert all files to ascii only. I also wish to convert different versions of characters like -
and '
to their plain ascii equivalents. For example b"\xe2\x80\x94".decode("utf8")
should be converted to -
. The most important thing is that the text is easy to read. I don't want don t
for example, but rather don't
instead.
我该怎么做?
我可以使用Python 2或3来解决这个问题。
I can use either Python 2 or 3 to solve this.
这是我所得到的Python2。我正在尝试检测那些连续的非ASCII字符开始的行。 os.listdir('。')中的file_name的
This is as far as I got for Python2. I'm trying to detect those lines which continua non ascii characters to begin with.
for file_name in os.listdir('.'):
print(file_name)
r = chardet.detect(open(file_name).read())
charenc = r['encoding']
with open(file_name,"r" ) as f:
for line in f.readlines():
if line.decode(charenc) != line.decode("ascii","ignore"):
print(line.decode("ascii","ignore"))
这给了我以下例外:
if line.decode(charenc) != line.decode("ascii","ignore"):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 6: truncated data
推荐答案
不要使用 .readlines()
一个二进制文件,字节行。在UTF-16,little-endian中,换行符编码为两个字节, 0A
(ASCII换行)和 00
(一个NULL)。 .readlines()
分开这两个字节的第一个,留下不完整的数据进行解码。
Don't use .readlines()
an a binary file with multi-byte lines. In UTF-16, little-endian, a newline is encoded as two bytes, 0A
(in ASCII a newline) and 00
(a NULL). .readlines()
splits on the first of those two bytes, leaving you with incomplete data to decode.
使用 io
库重新打开该文件,以方便解码:
Reopen the file with the io
library for ease of decoding:
import io
for file_name in os.listdir('.'):
print(file_name)
r = chardet.detect(open(file_name).read())
charenc = r['encoding']
with io.open(file_name, "r", encoding=charenc) as f:
for line in f:
line = line.encode("ascii", "ignore"):
print line
要用ASCII友好字符替换特定的unicode码点,请使用字典映射代码点到代码点或unicode字符串,并调用 line.translate()
首先:
To replace specific unicode codepoints with ASCII-friendly characters, use a dictionary mapping codepoint to codepoint or unicode string and call line.translate()
first:
charmap = {
0x2014: u'-', # em dash
0x201D: u'"', # comma quotation mark, double
# etc.
}
line = line.translate(charmap)
我使用十六进制整数文字来定义从这里将从映射的unicode代码点。字典中的值必须是unicode字符串,整数(代码点)或无
以完全删除该代码点。
I used hexadecimal integer literals to define the unicode codepoints to map from here. The value in the dictionary must be a unicode string, an integer (a codepoint) or None
to delete that codepoint altogether.
这篇关于将不同的编码转换成ASCII码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!