Python从文件中读取并删除非ascii字符 [英] Python read from file and remove non-ascii characters
问题描述
我有以下程序,逐字读取文件,并将该字再次写入另一个文件,但没有第一个文件中的非ASCII字符。
I have the following program that reads a file word by word and writes the word again to another file but without the non-ascii characters from the first file.
import unicodedata
import codecs
infile = codecs.open('d.txt','r',encoding='utf-8',errors='ignore')
outfile = codecs.open('d_parsed.txt','w',encoding='utf-8',errors='ignore')
for line in infile.readlines():
for word in line.split():
outfile.write(word+" ")
outfile.write("\n")
infile.close()
outfile.close()
我面临的是,使用这个代码,它不打印一个新行到第二个文件(d_parsed)。任何线索
The only problem that I am facing is that with this code it does not print a new line to the second file (d_parsed). Any clues??
推荐答案
codecs.open()
支持通用换行符例如,它不会在Windows上阅读时将 \r\\\
翻译为
\\\
。
codecs.open()
doesn't support universal newlines e.g., it doesn't translate \r\n
to \n
while reading on Windows.
改用 io.open()
:
#!/usr/bin/env python
from __future__ import print_function
import io
with io.open('d.txt','r',encoding='utf-8',errors='ignore') as infile, \
io.open('d_parsed.txt','w',encoding='ascii',errors='ignore') as outfile:
for line in infile:
print(*line.split(), file=outfile)
btw,如果要删除非ascii字符,应该使用 ascii
,而不是 utf-8
。
btw, if you want to remove non-ascii characters, you should use ascii
instead of utf-8
.
如果输入编码与ascii兼容(例如utf-8),那么您可以以二进制模式打开该文件,并使用 bytes.translate ()
删除非ascii字符:
If the input encoding is compatible with ascii (such as utf-8) then you could open the file in binary mode and use bytes.translate()
to remove non-ascii characters:
#!/usr/bin/env python
nonascii = bytearray(range(0x80, 0x100))
with open('d.txt','rb') as infile, open('d_parsed.txt','wb') as outfile:
for line in infile: # b'\n'-separated lines (Linux, OSX, Windows)
outfile.write(line.translate(None, nonascii))
它不像第一个代码示例那样规范化空格。
It doesn't normalize whitespace like the first code example.
这篇关于Python从文件中读取并删除非ascii字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!