Python CSV文件UTF-16到UTF-8打印错误 [英] Python CSV file UTF-16 to UTF-8 print error
问题描述
网络上有很多关于此问题的主题,但是我似乎找不到适合我具体情况的答案。
There is a number of topics on this problem around the web, but I can not seem to find the answer for my specific case.
我有一个CSV文件。我不确定对其做了什么,但是当我尝试打开它时,我得到了:
I have a CSV file. I am not sure what was was done to it, but when I try to open it, I get:
UnicodeDecodeError:'utf8'编解码器无法解码位置0的字节0xff:无效的起始字节
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
这里是完整的 Traceback
:
Traceback (most recent call last):
File "keywords.py", line 31, in <module>
main()
File "keywords.py", line 28, in main
get_csv(file_full_path)
File "keywords.py", line 19, in get_csv
for row in reader:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)
借助堆栈溢出,我将其打开:
With the help of Stack Overflow, I got it open with:
reader = csv.reader(codecs.open(file_full_path,'rU','UTF-16'),delimiter ='\t',quotechar =''))
现在的问题是,当我读取文件时:
Now the problem is that when I am reading the file:
def get_csv(file_full_path):
import csv, codecs
reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
for row in reader:
print row
符号:
UnicodeEncodeError:'ascii'编解码器无法对位置10处的字符u'\u5a07'进行编码:序数不在范围内(128)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)
我已经尝试过 decode
,'encode ', unicode()
包含该字符的字符串,但这似乎没有帮助。
I have tried decode
, 'encode', unicode()
on the string containing that character, but it does not seem help.
for row in reader:
#decoded_row = [element_s.decode('UTF-8') for element_s in row]
#print decoded_row
encoded_row = [element_s.encode('UTF-8') for element_s in row]
print encoded_row
我一点都不明白为什么。如果我
At this point I do not really understand why. If I
>>> print u'\u5a07'
娇
或
>>> print '娇'
娇
它有效。同样在终端中,它也可以工作。我已经检查了终端和Python shell上的默认编码,到处都是UTF-8。并且可以轻松打印该符号。我认为这与我使用UTF-16使用 codecs
打开文件有关。
it works. Also in terminal, it also works. I have checked The default encoding on terminal and Python shell, it is UTF-8 everywhere. And it prints that symbol easily. I assume that it has something to do with me opening file with codecs
using UTF-16.
我不是确定从这里去哪里。有人可以帮忙吗?
I am not sure where to go from here. Could anyone help out?
推荐答案
csv
模块可以不是处理Unicode输入。它在其文档页面上专门说:
The csv
module can not handle Unicode input. It says so specifically on its documentation page:
注意:此版本的
csv
模块不支持Unicode输入。另外,当前还存在一些有关ASCII NUL字符的问题。因此,所有输入均应为UTF-8或可打印的ASCII,以确保安全;
Note: This version of the
csv
module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe;
您需要将CSV文件转换为UTF-8以便该模块可以处理它:
You need to convert your CSV file to UTF-8 so that the module can deal with it:
with codecs.open(file_full_path, 'rU', 'UTF-16') as infile:
with open(file_full_path + '.utf8', 'wb') as outfile:
for line in infile:
outfile.write(line.encode('utf8'))
或者,您可以使用命令行实用程序 iconv
为您转换文件。
Alternatively, you can use the command-line utility iconv
to convert the file for you.
然后使用该重新编码的文件读取您的数据:
Then use that re-coded file to read your data:
reader = csv.reader(open(file_full_path + '.utf8', 'rb'), delimiter='\t', quotechar='"')
for row in reader:
print [c.decode('utf8') for c in row]
请注意,然后需要手动对列进行解码以进行unicode。
Note that the columns then need decoding to unicode manually.
这篇关于Python CSV文件UTF-16到UTF-8打印错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!