Python CSV文件UTF-16到UTF-8打印错误 [英] Python CSV file UTF-16 to UTF-8 print error

查看:164
本文介绍了Python CSV文件UTF-16到UTF-8打印错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

网络上有很多关于此问题的主题,但是我似乎找不到适合我具体情况的答案。

There is a number of topics on this problem around the web, but I can not seem to find the answer for my specific case.

我有一个CSV文件。我不确定对其做了什么,但是当我尝试打开它时,我得到了:

I have a CSV file. I am not sure what was was done to it, but when I try to open it, I get:


UnicodeDecodeError:'utf8'编解码器无法解码位置0的字节0xff:无效的起始字节

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

这里是完整的 Traceback

Traceback (most recent call last):
  File "keywords.py", line 31, in <module>
    main()
  File "keywords.py", line 28, in main
    get_csv(file_full_path)
  File "keywords.py", line 19, in get_csv
    for row in reader:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10:    ordinal    not in range(128)

借助堆栈溢出,我将其打开:

With the help of Stack Overflow, I got it open with:

reader = csv.reader(codecs.open(file_full_path,'rU','UTF-16'),delimiter ='\t',quotechar =''))

现在的问题是,当我读取文件时:

Now the problem is that when I am reading the file:

def get_csv(file_full_path):
    import csv, codecs
    reader = csv.reader(codecs.open(file_full_path, 'rU', 'UTF-16'), delimiter='\t', quotechar='"')
    for row in reader:
        print row

符号:


UnicodeEncodeError:'ascii'编解码器无法对位置10处的字符u'\u5a07'进行编码:序数不在范围内(128)

UnicodeEncodeError: 'ascii' codec can't encode character u'\u5a07' in position 10: ordinal not in range(128)

我已经尝试过 decode ,'encode ', unicode()包含该字符的字符串,但这似乎没有帮助。

I have tried decode, 'encode', unicode() on the string containing that character, but it does not seem help.

for row in reader:
    #decoded_row = [element_s.decode('UTF-8') for element_s in row]
    #print decoded_row
    encoded_row = [element_s.encode('UTF-8') for element_s in row]
    print encoded_row

我一点都不明白为什么。如果我

At this point I do not really understand why. If I

>>> print u'\u5a07'
娇

>>> print '娇'
娇

它有效。同样在终端中,它也可以工作。我已经检查了终端和Python shell上的默认编码,到处都是UTF-8。并且可以轻松打印该符号。我认为这与我使用UTF-16使用 codecs 打开文件有关。

it works. Also in terminal, it also works. I have checked The default encoding on terminal and Python shell, it is UTF-8 everywhere. And it prints that symbol easily. I assume that it has something to do with me opening file with codecs using UTF-16.

我不是确定从这里去哪里。有人可以帮忙吗?

I am not sure where to go from here. Could anyone help out?

推荐答案

csv 模块可以不是处理Unicode输入。它在其文档页面上专门说

The csv module can not handle Unicode input. It says so specifically on its documentation page:


注意:此版本的 csv 模块不支持Unicode输入。另外,当前还存在一些有关ASCII NUL字符的问题。因此,所有输入均应为UTF-8或可打印的ASCII,以确保安全;

Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe;

您需要将CSV文件转换为UTF-8以便该模块可以处理它:

You need to convert your CSV file to UTF-8 so that the module can deal with it:

with codecs.open(file_full_path, 'rU', 'UTF-16') as infile:
    with open(file_full_path + '.utf8', 'wb') as outfile:
        for line in infile:
            outfile.write(line.encode('utf8'))

或者,您可以使用命令行实用程序 iconv 为您转换文件。

Alternatively, you can use the command-line utility iconv to convert the file for you.

然后使用该重新编码的文件读取您的数据:

Then use that re-coded file to read your data:

 reader = csv.reader(open(file_full_path + '.utf8', 'rb'), delimiter='\t', quotechar='"')
 for row in reader:
     print [c.decode('utf8') for c in row]

请注意,然后需要手动对列进行解码以进行unicode。

Note that the columns then need decoding to unicode manually.

这篇关于Python CSV文件UTF-16到UTF-8打印错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆