使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError [英] UnicodeDecodeError when reading CSV file in Pandas with Python
问题描述
我正在运行一个程序,正在处理30,000个类似文件.他们中有随机的数目正在停止并产生此错误...
I'm running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error...
File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
data = pd.read_csv(filepath, names=fields)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
return parser.read()
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
ret = self._engine.read(nrows)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid continuation byte
这些文件的来源/创建都来自同一位置.纠正此错误以继续进行导入的最佳方法是什么?
The source/creation of these files all come from the same place. What's the best way to correct this to proceed with the import?
推荐答案
read_csv
采用encoding
选项来处理不同格式的文件.我主要使用read_csv('file', encoding = "ISO-8859-1")
或encoding = "utf-8"
进行阅读,通常使用utf-8
进行to_csv
.
read_csv
takes an encoding
option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1")
, or alternatively encoding = "utf-8"
for reading, and generally utf-8
for to_csv
.
您也可以使用'latin'
之类的几个alias
选项之一,而不是'ISO-8859-1'
(请参见 Python文档,也适用于您可能会遇到的许多其他编码.
You can also use one of several alias
options like 'latin'
instead of 'ISO-8859-1'
(see python docs, also for numerous other encodings you may encounter).
请参见相关的熊猫文档, csv文件上的python文档示例,以及有关SO的许多相关问题.良好的背景资源是
See relevant Pandas documentation, python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.
要检测编码(假设文件包含非ASCII字符),可以使用enca
(请参阅手册页).
To detect the encoding (assuming the file contains non-ascii characters), you can use enca
(see man page) or file -i
(linux) or file -I
(osx) (see man page).
这篇关于使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!