读取〜13000行CSV文件的部分,包含pandas read_csv和nrows [英] Reading parts of ~13000 row CSV file with pandas read_csv and nrows
问题描述
我试图读取一个CSV文件的段到一个pandas的DataFrame,我遇到麻烦,当我设置nrows超过某一点。我的CSV文件被分成不同的段,具有不同的标题/数据类型,所以我已经浏览了文件,找到不同段的行号,并保存行号。当我尝试做:
pd.io.parsers.read_csv('filename',skiprows = 40,nrows = 12646)
它工作正常。任何更多的行,它会抛出一个错误:
CParserError: C错误:第13897行中的56个字段,锯71
这是真的,13897行有许多行,这就是为什么我试图使用nrows和skiprows。我可以找到大熊猫将读取的最后一行,它看起来没有任何不同于其余。在十六进制编辑器中查看文件我仍然没有看到任何差异。
我也试过另一个CSV文件,我得到类似的结果: / p>
pd.io.parsers.read_csv('file2',skiprows = 112,nrows = 18524)
< class'pandas.core.frame.DataFrame'>
Int64Index:18188 entries,0 to 18187
但是:
pd.io.parsers.read_csv('file2',skiprows = 112,nrows = 18525)
pre>
给出:
CParserError: C错误:第19190行中的56个字段,锯71
还有其他方法吗?
我使用的是:
pandas-0.10.1.win-amd64-py3.3
,numpy-MKL-1.7.1rc1.win-amd64-py3.3
和python-3.3.0.amd64
在Windows上。我得到与
numpy-unoptimized-1.7.1rc1.win-amd64-py3.3
相同的问题。解决方案您可以使用
warn_bad_lines
和error_bad_lines
关闭坏行错误&警告:import pandas as pd
来自StringIO import StringIO
data = StringIO(a ,b,c
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4, 5)
pd.read_csv(data,warn_bad_lines = False,error_bad_lines = False)
I'm trying to read segments of a CSV file into a pandas DataFrame, and I'm running into trouble when I set nrows to more than a certain point. My CSV file is split up into different segments with different headers/types of data, so I've gone through the file and found the line numbers of the different segments, and saved the line numbers. When I try to do:
pd.io.parsers.read_csv('filename',skiprows=40, nrows=12646)
It works fine. Any more rows, and it throws an error:
CParserError: Error tokenizing data. C error: Expected 56 fields in line 13897, saw 71
It's true that line 13897 has that many rows, that's why I'm trying to use nrows and skiprows. I can find the last row that pandas will read and it doesn't look any different from the rest. Looking at the file in a hex editor I still don't see any difference.
I've also tried it with another CSV file, and I get similar results:
pd.io.parsers.read_csv('file2',skiprows=112, nrows=18524) <class 'pandas.core.frame.DataFrame'> Int64Index: 18188 entries, 0 to 18187
But:
pd.io.parsers.read_csv('file2',skiprows=112, nrows=18525)
gives:
CParserError: Error tokenizing data. C error: Expected 56 fields in line 19190, saw 71
Is there something I'm missing? Is there another way to do this?
I'm using:
pandas-0.10.1.win-amd64-py3.3
,numpy-MKL-1.7.1rc1.win-amd64-py3.3
, andpython-3.3.0.amd64
on Windows. I get the same issue withnumpy-unoptimized-1.7.1rc1.win-amd64-py3.3
.解决方案You can use
warn_bad_lines
anderror_bad_lines
to turn off bad line error & warning:import pandas as pd from StringIO import StringIO data = StringIO("""a,b,c 1,2,3 4,5,6 6,7,8,9 1,2,5 3,4,5""") pd.read_csv(data, warn_bad_lines=False, error_bad_lines=False)
这篇关于读取〜13000行CSV文件的部分,包含pandas read_csv和nrows的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!