读取pandas数据框的前几行的方法 [英] Way to read first few lines for pandas dataframe

查看:4044
本文介绍了读取pandas数据框的前几行的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否存在使用read_csv来仅读取文件的前n行的内置方法,而无需提前知道行的长度?我有一个大文件,需要花费很长时间才能读取,偶尔只想使用前20行来获取它的样本(并且不希望加载完整的文件并花掉它的头). /p>

如果我知道总行数,则可以执行类似footer_lines = total_lines - n的操作,并将其传递给skipfooter关键字arg.我当前的解决方案是使用python和StringIO手动将前n行抓取到熊猫:

import pandas as pd
from StringIO import StringIO

n = 20
with open('big_file.csv', 'r') as f:
    head = ''.join(f.readlines(n))

df = pd.read_csv(StringIO(head))

这还不错,但是有没有更简洁的"pandasic"(?)方式来使用关键字或其他方式呢?

解决方案

我认为您可以使用nrows参数.来自文档:

nrows : int, default None

    Number of rows of file to read. Useful for reading pieces of large files

这似乎有效.使用标准的大型测试文件之一(988504479字节,5344499行):

In [1]: import pandas as pd

In [2]: time z = pd.read_csv("P00000001-ALL.csv", nrows=20)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s

In [3]: len(z)
Out[3]: 20

In [4]: time z = pd.read_csv("P00000001-ALL.csv")
CPU times: user 27.63 s, sys: 1.92 s, total: 29.55 s
Wall time: 30.23 s

Is there a built-in way to use read_csv to read only the first n lines of a file without knowing the length of the lines ahead of time? I have a large file that takes a long time to read, and occasionally only want to use the first, say, 20 lines to get a sample of it (and prefer not to load the full thing and take the head of it).

If I knew the total number of lines I could do something like footer_lines = total_lines - n and pass this to the skipfooter keyword arg. My current solution is to manually grab the first n lines with python and StringIO it to pandas:

import pandas as pd
from StringIO import StringIO

n = 20
with open('big_file.csv', 'r') as f:
    head = ''.join(f.readlines(n))

df = pd.read_csv(StringIO(head))

It's not that bad, but is there a more concise, 'pandasic' (?) way to do it with keywords or something?

解决方案

I think you can use the nrows parameter. From the docs:

nrows : int, default None

    Number of rows of file to read. Useful for reading pieces of large files

which seems to work. Using one of the standard large test files (988504479 bytes, 5344499 lines):

In [1]: import pandas as pd

In [2]: time z = pd.read_csv("P00000001-ALL.csv", nrows=20)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s

In [3]: len(z)
Out[3]: 20

In [4]: time z = pd.read_csv("P00000001-ALL.csv")
CPU times: user 27.63 s, sys: 1.92 s, total: 29.55 s
Wall time: 30.23 s

这篇关于读取pandas数据框的前几行的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆