使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError [英] UnicodeDecodeError when reading CSV file in Pandas with Python

查看:480
本文介绍了使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行一个程序,正在处理30,000个类似文件.他们中有随机的数目正在停止并产生此错误...

I'm running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error...

   File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
     data = pd.read_csv(filepath, names=fields)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
     return _read(filepath_or_buffer, kwds)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
     return parser.read()
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
     ret = self._engine.read(nrows)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
     data = self._reader.read(nrows)
   File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
   File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
   File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
   File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
   File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
   File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
   File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
   File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

这些文件的来源/创建都来自同一位置.纠正此错误以继续进行导入的最佳方法是什么?

The source/creation of these files all come from the same place. What's the best way to correct this to proceed with the import?

推荐答案

read_csv采用encoding选项来处理不同格式的文件.我主要使用read_csv('file', encoding = "ISO-8859-1")encoding = "utf-8"进行阅读,通常使用utf-8进行to_csv.

read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf-8" for reading, and generally utf-8 for to_csv.

您也可以使用'latin'之类的几个alias选项之一,而不是'ISO-8859-1'(请参见 Python文档,也适用于您可能会遇到的许多其他编码.

You can also use one of several alias options like 'latin' instead of 'ISO-8859-1' (see python docs, also for numerous other encodings you may encounter).

请参见相关的熊猫文档 csv文件上的python文档示例,以及有关SO的许多相关问题.良好的背景资源是

See relevant Pandas documentation, python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.

要检测编码(假设文件包含非ASCII字符),可以使用enca(请参阅

To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page) or file -i (linux) or file -I (osx) (see man page).

这篇关于使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆