使用 Python 在 Pandas 中读取 CSV 文件时出现 UnicodeDecodeError [英] UnicodeDecodeError when reading CSV file in Pandas with Python
问题描述
我正在运行一个正在处理 30,000 个类似文件的程序.他们中的随机数正在停止并产生此错误...
文件C:Importersrcdfmanimporter.py",第 26 行,在 import_chr数据 = pd.read_csv(文件路径,名称=字段)文件C:Python33libsite-packagespandasioparsers.py",第 400 行,在 parser_f 中返回 _read(filepath_or_buffer, kwds)文件C:Python33libsite-packagespandasioparsers.py",第 205 行,在 _read返回 parser.read()文件C:Python33libsite-packagespandasioparsers.py",第 608 行,读取ret = self._engine.read(nrows)文件C:Python33libsite-packagespandasioparsers.py",第 1028 行,读取数据 = self._reader.read(nrows)文件parser.pyx",第 706 行,位于 pandas.parser.TextReader.read (pandasparser.c:6745)文件parser.pyx",第 728 行,在 pandas.parser.TextReader._read_low_memory (pandasparser.c:6964)文件parser.pyx",第 804 行,位于 pandas.parser.TextReader._read_rows (pandasparser.c:7780)文件parser.pyx",第 890 行,在 pandas.parser.TextReader._convert_column_data (pandasparser.c:8793)文件parser.pyx",第 950 行,在 pandas.parser.TextReader._convert_tokens (pandasparser.c:9484)文件parser.pyx",第 1026 行,在 pandas.parser.TextReader._convert_with_dtype (pandasparser.c:10642)文件parser.pyx",第 1046 行,在 pandas.parser.TextReader._string_convert (pandasparser.c:10853)文件parser.pyx",第 1278 行,位于 pandas.parser._string_box_utf8 (pandasparser.c:15657)UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 6 的字节 0xda: 无效的连续字节
这些文件的来源/创建都来自同一个地方.纠正此问题以继续导入的最佳方法是什么?
read_csv
采用 encoding
选项来处理不同格式的文件.我主要使用 read_csv('file', encoding = "ISO-8859-1")
或 encoding = "utf-8"
来读取,通常是 utf-8
用于 to_csv
.
您还可以使用多个 请参阅相关 Pandas 文档,关于 csv 文件的 python 文档示例,以及关于 SO 的大量相关问题.一个很好的背景资源是 每个开发人员都应该了解的关于 unicode 和字符集的内容. 要检测编码(假设文件包含非 ascii 字符),您可以使用 I'm running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error... The source/creation of these files all come from the same place. What's the best way to correct this to proceed with the import? You can also use one of several See relevant Pandas documentation,
python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets. To detect the encoding (assuming the file contains non-ascii characters), you can use 这篇关于使用 Python 在 Pandas 中读取 CSV 文件时出现 UnicodeDecodeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!alias
选项之一,例如 'latin'
而不是 'ISO-8859-1'
(请参阅 enca
(请参阅 手册页) 或 file -i
(linux) 或 file -I
(osx)(请参阅 手册页).File "C:Importersrcdfmanimporter.py", line 26, in import_chr
data = pd.read_csv(filepath, names=fields)
File "C:Python33libsite-packagespandasioparsers.py", line 400, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:Python33libsite-packagespandasioparsers.py", line 205, in _read
return parser.read()
File "C:Python33libsite-packagespandasioparsers.py", line 608, in read
ret = self._engine.read(nrows)
File "C:Python33libsite-packagespandasioparsers.py", line 1028, in read
data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandasparser.c:6745)
File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandasparser.c:6964)
File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandasparser.c:7780)
File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandasparser.c:8793)
File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandasparser.c:9484)
File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandasparser.c:10642)
File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandasparser.c:10853)
File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandasparser.c:15657)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid continuation byte
read_csv
takes an encoding
option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1")
, or alternatively encoding = "utf-8"
for reading, and generally utf-8
for to_csv
.alias
options like 'latin'
instead of 'ISO-8859-1'
(see python docs, also for numerous other encodings you may encounter).enca
(see man page) or file -i
(linux) or file -I
(osx) (see man page).