pandas 读取.csv文件 [英] pandas reading .csv files
问题描述
我有一个小脚本,可以使用从MS Excel生成的熊猫读取和打印.csv文件.
I have a small script to read and print a .csv file using pandas generated from MS Excel.
import pandas as pd
data = pd.read_csv('./2010-11.csv')
print(data)
现在此脚本在Python 2.7.8中运行,但是在Python 3.4.1中提供了以下内容 错误.任何想法为什么会这样?预先感谢您对此提供的任何帮助.
now this script runs in Python 2.7.8 but in Python 3.4.1 gives the following error. Any ideas why this might be so? Thanks in advance for any help with this.
Traceback (most recent call last):
File "proc_csv_0-0.py", line 3, in <module>
data = pd.read_csv('./2010-11.csv')
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 474, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 260, in _read
return parser.read()
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 721, in read
ret = self._engine.read(nrows)
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 1170, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:7566)
File "pandas/parser.pyx", line 791, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7806)
File "pandas/parser.pyx", line 866, in pandas.parser.TextReader._read_rows (pandas/parser.c:8639)
File "pandas/parser.pyx", line 973, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:9950)
File "pandas/parser.pyx", line 1033, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:10737)
File "pandas/parser.pyx", line 1130, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:12141)
File "pandas/parser.pyx", line 1150, in pandas.parser.TextReader._string_convert (pandas/parser.c:12355)
File "pandas/parser.pyx", line 1382, in pandas.parser._string_box_utf8 (pandas/parser.c:17679)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 4: unexpected end of data
推荐答案
在Python3中,当传递pd.read_csv
的文件路径(与文件缓冲区相对)时,默认情况下会使用utf-8
编解码器对内容进行解码. 1 看来您的CSV文件正在使用其他编码.由于它是由MS Excel生成的,因此可能为cp-1252:
In Python3, when pd.read_csv
is passed a file path (as opposed to a file buffer) it decodes the contents with the utf-8
codec by default.1 It appears your CSV file is using a different encoding. Since it was generated by MS Excel, it might be cp-1252:
In [25]: print('\xc9'.decode('cp1252'))
É
In [27]: import unicodedata as UDAT
In [28]: UDAT.name('\xc9'.decode('cp1252'))
Out[28]: 'LATIN CAPITAL LETTER E WITH ACUTE'
错误消息
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9
说'\xc9'.decode('utf-8')
引发UnicodeDecodeError.
says that '\xc9'.decode('utf-8')
raises a UnicodeDecodeError.
上面显示了字节0xc9可以用cp1252
解码.是否还可以使用cp1252
解码文件的其余部分,以及是否产生期望的结果,还有待观察.
The above shows byte 0xc9 can be decoded with cp1252
. It remains to be seen if the rest of the file can also be decoded with cp1252
, and if it produces the desired result.
不幸的是,仅给出一个文件,没有确定方法可以告诉您什么 编码(如果有)被使用.这完全取决于用于生成的程序 文件.
Unfortunately, given only a file, there is no surefire way to tell what encoding (if any) was used. It depends entirely on the program used to generate the file.
如果cp1252
是正确的编码,请使用
If cp1252
is the right encoding, then to load the file into a DataFrame use
data = pd.read_csv('./2010-11.csv', encoding='cp1252')
1 当传递pd.read_csv
缓冲区时,该缓冲区可能已经打开且设置了encoding
:
1 When pd.read_csv
is passed a buffer, the buffer could have been opened with encoding
already set:
# Python3
with open('/tmp/test.csv', 'r', encoding='cp1252') as f:
df = pd.read_csv(f)
print(df)
在这种情况下,由于缓冲区f
已经提供了解码的字符串,因此pd.read_csv
不会尝试解码.
in which case pd.read_csv
will not attempt to decode since the buffer f
is already supplying decoded strings.
这篇关于 pandas 读取.csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!