调用函数时的 pandas ,大数据,HDF表和内存使用情况 [英] Pandas, large data, HDF tables and memory usage when calling a function

查看:97
本文介绍了调用函数时的 pandas ,大数据,HDF表和内存使用情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简短问题

当熊猫在HDFStore上工作时(例如.mean()或.apply()),它会将完整数据作为DataFrame加载到内存中,还是作为Serie处理逐条记录? >

详细说明

我必须处理大型数据文件,并且可以指定数据文件的输出格式.

我打算使用Pandas来处理数据,我想设置最佳格式以使其发挥最大性能.

我已经看到panda.read_table()已经走了很长一段路,但是它仍然至少需要至少与我们要读取转换的原始文件大小一样多的内存(实际上至少是内存的两倍)到一个DataFrame中.这可能适用于最大1 GB的文件,但更高?这可能很难,尤其是在在线共享计算机上.

但是,我已经看到,现在Pandas似乎支持使用pytables的HDF表.

我的问题是:当我们在整个HDF表上执行操作时,Pandas如何管理内存?例如.mean()或.apply().它是先将整个表加载到DataFrame中,还是直接在不存储在内存中的情况下通过HDF文件对数据进行处理?

侧面问题:hdf5格式在磁盘使用方面是否紧凑?我的意思是,它像xml一样冗长,还是像JSON一样冗长? (我知道有索引和资料,但是我对数据的裸露描述很感兴趣)

解决方案

我想我已经找到了答案:是和否,这取决于您如何加载Pandas DataFrame.

与read_table()方法一样,您有一个迭代器"参数,该参数允许获取一个生成器对象,该对象一次只能获取一条记录,如下所述: http://pytables.github.com/usersguide/optimization.html

Short question

When Pandas work on a HDFStore (eg: .mean() or .apply() ), does it load the full data in memory as a DataFrame, or does it process record-by-record as a Serie?

Long description

I have to work on large data files, and I can specify the output format of the data file.

I intend to use Pandas to process the data, and I would like to setup the best format so that it maximizes the performances.

I have seen that panda.read_table() has gone a long way, but it still at least takes at least as much memory (in fact at least twice the memory) as the original file size that we want to read to transform into a DataFrame. This may work for files up to 1 GB, but above? That may be hard, especially on online shared machines.

However, I have seen that now Pandas seems to support HDF tables using pytables.

My question is: how does Pandas manage the memory when we do an operation on a whole HDF table? For example a .mean() or .apply(). Does it first load the entire table in a DataFrame, or does it process the function over data directly from the HDF file without storing in memory?

Side-question: is the hdf5 format compact on disk usage? I mean, is it verbose like xml or more like JSON? (I know there are indexes and stuff, but I am here interested in the bare description of the data)

解决方案

I think I have found the answer: yes and no, it depends on how you load your Pandas DataFrame.

As with the read_table() method, you have an "iterator" argument which allows to get a generator object which will get only one record at a time, as explained here: http://pandas.pydata.org/pandas-docs/dev/io.html#iterator

Now, I don't know how functions like .mean() and .apply() would work with these generators.

If someone has more info/experience, feel free to share!

About HDF5 overhead:

HDF5 keeps a B-tree in memory that is used to map chunk structures on disk. The more chunks that are allocated for a dataset the larger the B-tree. Large B-trees take memory and cause file storage overhead as well as more disk I/O and higher contention forthe metadata cache. Consequently, it’s important to balance between memory and I/O overhead (small B-trees) and time to access data (big B-trees).

http://pytables.github.com/usersguide/optimization.html

这篇关于调用函数时的 pandas ,大数据,HDF表和内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆