pandas 读取.csv文件 [英] pandas reading .csv files

查看:93
本文介绍了 pandas 读取.csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个小脚本,可以使用从MS Excel生成的熊猫读取和打印.csv文件.

I have a small script to read and print a .csv file using pandas generated from MS Excel.

import pandas as pd
data = pd.read_csv('./2010-11.csv')
print(data)

现在此脚本在Python 2.7.8中运行,但是在Python 3.4.1中提供了以下内容 错误.任何想法为什么会这样?预先感谢您对此提供的任何帮助.

now this script runs in Python 2.7.8 but in Python 3.4.1 gives the following error. Any ideas why this might be so? Thanks in advance for any help with this.

Traceback (most recent call last):
  File "proc_csv_0-0.py", line 3, in <module>
    data = pd.read_csv('./2010-11.csv')
  File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 474, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 260, in _read
    return parser.read()
  File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 721, in read
    ret = self._engine.read(nrows)
  File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 1170, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:7566)
  File "pandas/parser.pyx", line 791, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7806)
  File "pandas/parser.pyx", line 866, in pandas.parser.TextReader._read_rows (pandas/parser.c:8639)
  File "pandas/parser.pyx", line 973, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:9950)
  File "pandas/parser.pyx", line 1033, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:10737)
  File "pandas/parser.pyx", line 1130, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:12141)
  File "pandas/parser.pyx", line 1150, in pandas.parser.TextReader._string_convert (pandas/parser.c:12355)
  File "pandas/parser.pyx", line 1382, in pandas.parser._string_box_utf8 (pandas/parser.c:17679)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 4: unexpected end of data

推荐答案

在Python3中,当传递pd.read_csv的文件路径(与文件缓冲区相对)时,默认情况下会使用utf-8编解码器对内容进行解码. 1 看来您的CSV文件正在使用其他编码.由于它是由MS Excel生成的,因此可能为cp-1252:

In Python3, when pd.read_csv is passed a file path (as opposed to a file buffer) it decodes the contents with the utf-8 codec by default.1 It appears your CSV file is using a different encoding. Since it was generated by MS Excel, it might be cp-1252:

In [25]: print('\xc9'.decode('cp1252'))
É

In [27]: import unicodedata as UDAT   
In [28]: UDAT.name('\xc9'.decode('cp1252'))
Out[28]: 'LATIN CAPITAL LETTER E WITH ACUTE'

错误消息

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9

'\xc9'.decode('utf-8')引发UnicodeDecodeError.

says that '\xc9'.decode('utf-8') raises a UnicodeDecodeError.

上面显示了字节0xc9可以用cp1252解码.是否还可以使用cp1252解码文件的其余部分,以及是否产生期望的结果,还有待观察.

The above shows byte 0xc9 can be decoded with cp1252. It remains to be seen if the rest of the file can also be decoded with cp1252, and if it produces the desired result.

不幸的是,仅给出一个文件,没有确定方法可以告诉您什么 编码(如果有)被使用.这完全取决于用于生成的程序 文件.

Unfortunately, given only a file, there is no surefire way to tell what encoding (if any) was used. It depends entirely on the program used to generate the file.

如果cp1252是正确的编码,请使用

If cp1252 is the right encoding, then to load the file into a DataFrame use

data = pd.read_csv('./2010-11.csv', encoding='cp1252') 


1 当传递pd.read_csv缓冲区时,该缓冲区可能已经打开且设置了encoding:


1 When pd.read_csv is passed a buffer, the buffer could have been opened with encoding already set:

# Python3
with open('/tmp/test.csv', 'r', encoding='cp1252') as f:
    df = pd.read_csv(f)
    print(df)

在这种情况下,由于缓冲区f已经提供了解码的字符串,因此pd.read_csv不会尝试解码.

in which case pd.read_csv will not attempt to decode since the buffer f is already supplying decoded strings.

这篇关于 pandas 读取.csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆