pandas 的“分组依据"在HDFStore中查询大数据? [英] Pandas "Group By" Query on Large Data in HDFStore?

查看:36
本文介绍了 pandas 的“分组依据"在HDFStore中查询大数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在HDFStore中有大约700万行,其中有60列以上.数据超出了我的内存容量.我正在基于列"A"的值将数据聚合到组中.熊猫拆分/汇总/合并的文档假定我的所有数据都放在,但是我无法将整个商店读入内存DataFrame中.在HDFStore中对数据进行分组的正确方法是什么?

I have about 7 million rows in an HDFStore with more than 60 columns. The data is more than I can fit into memory. I'm looking to aggregate the data into groups based on the value of a column "A". The documentation for pandas splitting/aggregating/combining assumes that I have all my data in a DataFrame already, however I can't read the entire store into an in-memory DataFrame. What is the correct approach for grouping data in an HDFStore?

推荐答案

这里有一个完整的示例.

Heres a complete example.

import numpy as np
import pandas as pd
import os

fname = 'groupby.h5'

# create a frame
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'foo',
                         'bar', 'bar', 'bar', 'bar',
                         'foo', 'foo', 'foo'],
                   'B': ['one', 'one', 'one', 'two',
                         'one', 'one', 'one', 'two',
                         'two', 'two', 'one'],
                   'C': ['dull', 'dull', 'shiny', 'dull',
                         'dull', 'shiny', 'shiny', 'dull',
                         'shiny', 'shiny', 'shiny'],
                   'D': np.random.randn(11),
                   'E': np.random.randn(11),
                   'F': np.random.randn(11)})


# create the store and append, using data_columns where I possibily
# could aggregate
with pd.get_store(fname) as store:
    store.append('df',df,data_columns=['A','B','C'])
    print "store:\n%s" % store

    print "\ndf:\n%s" % store['df']

    # get the groups
    groups = store.select_column('df','A').unique()
    print "\ngroups:%s" % groups

    # iterate over the groups and apply my operations
    l = []
    for g in groups:

        grp = store.select('df',where = [ 'A=%s' % g ])

        # this is a regular frame, aggregate however you would like
        l.append(grp[['D','E','F']].sum())


    print "\nresult:\n%s" % pd.concat(l, keys = groups)

os.remove(fname)

输出

store:
<class 'pandas.io.pytables.HDFStore'>
File path: groupby.h5
/df            frame_table  (typ->appendable,nrows->11,ncols->6,indexers->[index],dc->[A,B,C])

df:
      A    B      C         D         E         F
0   foo  one   dull -0.815212 -1.195488 -1.346980
1   foo  one   dull -1.111686 -1.814385 -0.974327
2   foo  one  shiny -1.069152 -1.926265  0.360318
3   foo  two   dull -0.472180  0.698369 -1.007010
4   bar  one   dull  1.329867  0.709621  1.877898
5   bar  one  shiny -0.962906  0.489594 -0.663068
6   bar  one  shiny -0.657922 -0.377705  0.065790
7   bar  two   dull -0.172245  1.694245  1.374189
8   foo  two  shiny -0.780877 -2.334895 -2.747404
9   foo  two  shiny -0.257413  0.577804 -0.159316
10  foo  one  shiny  0.737597  1.979373 -0.236070

groups:Index([bar, foo], dtype=object)

result:
bar  D   -0.463206
     E    2.515754
     F    2.654810
foo  D   -3.768923
     E   -4.015488
     F   -6.110789
dtype: float64

一些警告:

1)如果您的组密度相对较低,则此方法很有意义.在数百或数千个组的顺序上.如果获得的收益更多,则效率更高(但方法更复杂),并且您正在应用的功能(在本例中为sum)将变得更加严格.

1) This methodology makes sense if your group density is relatively low. On the order of hundreds or thousands of groups. If you get more than that there are more efficient (but more complicated methods), and your function which you are applying (in this case sum) become more restrictive.

基本上,您将按块对整个商店进行迭代,然后按组进行分组,但是将组仅进行半折叠(想像一下就是一个平均值,因此您需要保持运行总数加上运行计数,然后除以结束).因此,某些操作会有些棘手,但可能会处理许多组(而且速度非常快).

Essentially you would iterator over the entire store by chunks, grouping as you go, but keeping the groups only semi-collapsed (imagine doing a mean, so you would need to keep a running total plus a running count, then divide at the end). So some operations would be a bit trickier, but could potentially handle MANY groups (and is really fast).

2)可以通过保存坐标(例如组位置,但是稍微复杂一点)来提高效率.

2) the efficiency of this could be improved by saving the coordinates (e.g. the group locations, but this is a bit more complicated)

3)这种方案不可能进行多重分组(有可能,但需要一种类似于2的方法)

3) multi-grouping is not possible with this scheme (it IS possible, but requires an approach more like 2) above

4)您要分组的列,必须是data_column!

4) the columns that you want to group, MUST be a data_column!

5)您可以在选择btw中组合您希望的任何其他过滤器(这是进行多组btw的一种时髦方法,您仅在它们的乘积上形成2个唯一的group和iterator列表,如果效率不高,您有很多小组,但可以工作)

5) you can combine any other filter you wish in the select btw (which is a sneeky way of doing multi-grouping btw, you just form 2 unique lists of group and iterator over the product of them, not extremely efficient if you have lots of groups, but can work)

HTH

让我知道这是否适合您

这篇关于 pandas 的“分组依据"在HDFStore中查询大数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆