读取pandas数据框的前几行的方法 [英] Way to read first few lines for pandas dataframe
问题描述
是否存在使用read_csv
来仅读取文件的前n
行的内置方法,而无需提前知道行的长度?我有一个大文件,需要花费很长时间才能读取,偶尔只想使用前20行来获取它的样本(并且不希望加载完整的文件并花掉它的头). /p>
如果我知道总行数,则可以执行类似footer_lines = total_lines - n
的操作,并将其传递给skipfooter
关键字arg.我当前的解决方案是使用python和StringIO手动将前n
行抓取到熊猫:
import pandas as pd
from StringIO import StringIO
n = 20
with open('big_file.csv', 'r') as f:
head = ''.join(f.readlines(n))
df = pd.read_csv(StringIO(head))
这还不错,但是有没有更简洁的"pandasic"(?)方式来使用关键字或其他方式呢?
我认为您可以使用nrows
参数.来自文档:
nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files
这似乎有效.使用标准的大型测试文件之一(988504479字节,5344499行):
In [1]: import pandas as pd
In [2]: time z = pd.read_csv("P00000001-ALL.csv", nrows=20)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
In [3]: len(z)
Out[3]: 20
In [4]: time z = pd.read_csv("P00000001-ALL.csv")
CPU times: user 27.63 s, sys: 1.92 s, total: 29.55 s
Wall time: 30.23 s
Is there a built-in way to use read_csv
to read only the first n
lines of a file without knowing the length of the lines ahead of time? I have a large file that takes a long time to read, and occasionally only want to use the first, say, 20 lines to get a sample of it (and prefer not to load the full thing and take the head of it).
If I knew the total number of lines I could do something like footer_lines = total_lines - n
and pass this to the skipfooter
keyword arg. My current solution is to manually grab the first n
lines with python and StringIO it to pandas:
import pandas as pd
from StringIO import StringIO
n = 20
with open('big_file.csv', 'r') as f:
head = ''.join(f.readlines(n))
df = pd.read_csv(StringIO(head))
It's not that bad, but is there a more concise, 'pandasic' (?) way to do it with keywords or something?
I think you can use the nrows
parameter. From the docs:
nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files
which seems to work. Using one of the standard large test files (988504479 bytes, 5344499 lines):
In [1]: import pandas as pd
In [2]: time z = pd.read_csv("P00000001-ALL.csv", nrows=20)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
In [3]: len(z)
Out[3]: 20
In [4]: time z = pd.read_csv("P00000001-ALL.csv")
CPU times: user 27.63 s, sys: 1.92 s, total: 29.55 s
Wall time: 30.23 s
这篇关于读取pandas数据框的前几行的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!