Python Pandas:使用Aggregate vs Apply定义新列 [英] Python Pandas: Using Aggregate vs Apply to define new columns

查看:98
本文介绍了Python Pandas:使用Aggregate vs Apply定义新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个像这样的数据框:

Suppose I have a dataframe like so:

n = 20
dim1 = np.random.randint(1, 3, size=n)
dim2 = np.random.randint(3, 5, size=n)
data1 = np.random.randint(10, 20, size=n)
data2 = np.random.randint(1, 10, size=n)
df = pd.DataFrame({'a': dim1, 'b': dim2 ,'val1': data1, 'val2': data2})

如果我定义了一个按组返回的函数:

If I define a function that returns group-wise:

def h(x):
    if x['val2'].sum() == 0:
        return 0
    else:
        return (x['val1'].sum())*1.0/x['val2'].sum()*1.0

按其中一列分组并进行汇总将返回结果:

Grouping by one of the columns and aggregating returns a result:

df.groupby(['a']).aggregate(h)['val1']

尽管它将所有现有列转换为所需结果,而不是添加新列

Albeit it converts all the existing columns to the desired result rather than adding a new column

按两列分组会导致在使用聚合时发生错误:

Grouping by two columns leads to an error when using aggregate:

df.groupby(['a','b']).aggregate(h)['val1']

KeyError: 'val2'

但是将聚合切换为apply似乎可行.

But switching aggregate for apply seems to work.

我有两个问题:

  1. 为什么要申请工作而不是总申请?
  2. 如果通过一组键对数据框进行分组后,我想使用一个将组值聚合为新列的函数,那么最好的方法是什么?

谢谢.

推荐答案

要稍微退后一步,执行此特定聚合"的更快方法是仅使用

To step back slightly, a faster way to do this particular "aggregation" is to just use sum (it's optimised in cython) a couple of times.

In [11]: %timeit g.apply(h)
1000 loops, best of 3: 1.79 ms per loop

In [12]: %timeit g['val1'].sum() / g['val2'].sum()
1000 loops, best of 3: 600 µs per loop

IMO,groupby代码很繁琐,通常通过创建一个列表来查看正在发生的值,从而偷偷摸摸地"窥视正在发生的事情:

IMO The groupby code is pretty hairy, and usually lazily "blackbox" peek at what's going on, by creating a list of what values it's seeing:

def h1(x):
   a.append(x)
   return h(x)
a = []

警告:有时此列表中的数据类型不一致(在这种情况下, pandas在进行任何计算之前会尝试一些不同的事情)...

第二个聚合卡在 each 列上,因此卡住了,所以该组(引发错误):

The second aggregation gets stuck applying on each column, so the group (which raises an error):

0     10
4     16
8     13
9     17
17    17
19    11
Name: val1, dtype: int64

这是val1列的子系列,其中(a,b)=(1,3).

这很可能是一个错误,引发这个问题之后它可能还会尝试其他方法(我怀疑这就是为什么firsts版本有效的原因,特别是这样)...

This may well be a bug, after this raises perhaps it could try something else (my suspicion is that this is why the firsts version works, it's special cased to)...

对于那些感兴趣的人,我得到的a是:

For those interested the a I get is:

In [21]: a
Out[21]: 
[SNDArray([125755456, 131767536,        13,        17,        17,        11]),
 Series([], name: val1, dtype: int64),
 0     10
4     16
8     13
9     17
17    17
19    11
Name: val1, dtype: int64]

我不知道SNDArray到底是关于什么的...

这篇关于Python Pandas:使用Aggregate vs Apply定义新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆