读取〜13000行CSV文件的部分,包含pandas read_csv和nrows [英] Reading parts of ~13000 row CSV file with pandas read_csv and nrows

查看:489
本文介绍了读取〜13000行CSV文件的部分,包含pandas read_csv和nrows的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图读取一个CSV文件的段到一个pandas的DataFrame,我遇到麻烦,当我设置nrows超过某一点。我的CSV文件被分成不同的段,具有不同的标题/数据类型,所以我已经浏览了文件,找到不同段的行号,并保存行号。当我尝试做:

  pd.io.parsers.read_csv('filename',skiprows = 40,nrows = 12646) 

它工作正常。任何更多的行,它会抛出一个错误:

  CParserError: C错误:第13897行中的56个字段,锯71 

这是真的,13897行有许多行,这就是为什么我试图使用nrows和skiprows。我可以找到大熊猫将读取的最后一行,它看起来没有任何不同于其余。在十六进制编辑器中查看文件我仍然没有看到任何差异。



我也试过另一个CSV文件,我得到类似的结果: / p>

  pd.io.parsers.read_csv('file2',skiprows = 112,nrows = 18524)

< class'pandas.core.frame.DataFrame'>
Int64Index:18188 entries,0 to 18187

但是:

  pd.io.parsers.read_csv('file2',skiprows = 112,nrows = 18525)
pre>

给出:

  CParserError: C错误:第19190行中的56个字段,锯71 

还有其他方法吗?



我使用的是: pandas-0.10.1.win-amd64-py3.3 numpy-MKL-1.7.1rc1.win-amd64-py3.3 python-3.3.0.amd64 在Windows上。我得到与 numpy-unoptimized-1.7.1rc1.win-amd64-py3.3 相同的问题。

解决方案

您可以使用 warn_bad_lines error_bad_lines 关闭坏行错误&警告:

  import pandas as pd 
来自StringIO import StringIO
data = StringIO(a ,b,c
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4, 5)
pd.read_csv(data,warn_bad_lines = False,error_bad_lines = False)


I'm trying to read segments of a CSV file into a pandas DataFrame, and I'm running into trouble when I set nrows to more than a certain point. My CSV file is split up into different segments with different headers/types of data, so I've gone through the file and found the line numbers of the different segments, and saved the line numbers. When I try to do:

pd.io.parsers.read_csv('filename',skiprows=40, nrows=12646)

It works fine. Any more rows, and it throws an error:

CParserError: Error tokenizing data. C error: Expected 56 fields in line 13897, saw 71

It's true that line 13897 has that many rows, that's why I'm trying to use nrows and skiprows. I can find the last row that pandas will read and it doesn't look any different from the rest. Looking at the file in a hex editor I still don't see any difference.

I've also tried it with another CSV file, and I get similar results:

pd.io.parsers.read_csv('file2',skiprows=112, nrows=18524)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18188 entries, 0 to 18187

But:

pd.io.parsers.read_csv('file2',skiprows=112, nrows=18525)

gives:

CParserError: Error tokenizing data. C error: Expected 56 fields in line 19190, saw 71

Is there something I'm missing? Is there another way to do this?

I'm using: pandas-0.10.1.win-amd64-py3.3, numpy-MKL-1.7.1rc1.win-amd64-py3.3, and python-3.3.0.amd64 on Windows. I get the same issue with numpy-unoptimized-1.7.1rc1.win-amd64-py3.3.

解决方案

You can use warn_bad_lines and error_bad_lines to turn off bad line error & warning:

import pandas as pd
from StringIO import StringIO
data = StringIO("""a,b,c
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5""")
pd.read_csv(data, warn_bad_lines=False, error_bad_lines=False)

这篇关于读取〜13000行CSV文件的部分,包含pandas read_csv和nrows的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆