在组对象上应用 vs 变换 [英] Apply vs transform on a group object

查看:25
本文介绍了在组对象上应用 vs 变换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下数据框:

columns = ['A', 'B', 'C', 'D']记录 = [['foo', 'one', 0.162003, 0.087469],['bar', 'one', -1.156319, -1.5262719999999999],['foo', '二', 0.833892, -1.666304],['bar', '三', -2.026673, -0.32205700000000004],['foo', '二', 0.41145200000000004, -0.9543709999999999],['bar', '二', 0.765878, -0.095968],['foo', 'one', -0.65489, 0.678091],['foo', '三', -1.789842, -1.130922]]df = pd.DataFrame.from_records(记录,列=列)"A B C D0 富一 0.162003 0.0874691 巴一 -1.156319 -1.5262722 富二 0.833892 -1.6663043 巴三 -2.026673 -0.3220574 富二 0.411452 -0.9543715 巴二 0.765878 -0.0959686 富一 -0.654890 0.6780917 富三 -1.789842 -1.130922"

以下命令有效:

df.groupby('A').apply(lambda x: (x['C'] - x['D']))df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

但以下均无效:

df.groupby('A').transform(lambda x: (x['C'] - x['D']))# KeyError 或 ValueError: 无法将输入数组从形状 (5) 广播到形状 (5,3)df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())# KeyError 或 TypeError:无法连接非 NDFrame 对象

为什么?


Transform 必须返回与组大小相同的单维序列

另一个区别是 transform 必须返回一个与组大小相同的单维序列.在此特定实例中,每个组有两行,因此 transform 必须返回两行的序列.如果没有,则会引发错误:

def return_three(x):返回 np.array([1, 2, 3])df.groupby('State').transform(return_three)ValueError:转换必须为每个组返回一个标量值

错误消息并未真正描述问题.您必须返回与组长度相同的序列.所以,这样的函数会起作用:

def rand_group_len(x):返回 np.random.rand(len(x))df.groupby('State').transform(rand_group_len)乙0 0.962070 0.1514401 0.440956 0.7821762 0.642218 0.4832573 0.056047 0.238208


返回单个标量对象也适用于 transform

如果您从自定义函数中只返回一个标量,则 transform 将对组中的每一行使用它:

def group_sum(x):返回 x.sum()df.groupby('State').transform(group_sum)乙0 9 161 9 162 4 143 4 14

Consider the following dataframe:

columns = ['A', 'B', 'C', 'D']
records = [
    ['foo', 'one', 0.162003, 0.087469],
    ['bar', 'one', -1.156319, -1.5262719999999999],
    ['foo', 'two', 0.833892, -1.666304],     
    ['bar', 'three', -2.026673, -0.32205700000000004],
    ['foo', 'two', 0.41145200000000004, -0.9543709999999999],
    ['bar', 'two', 0.765878, -0.095968],
    ['foo', 'one', -0.65489, 0.678091],
    ['foo', 'three', -1.789842, -1.130922]
]
df = pd.DataFrame.from_records(records, columns=columns)

"""
     A      B         C         D
0  foo    one  0.162003  0.087469
1  bar    one -1.156319 -1.526272
2  foo    two  0.833892 -1.666304
3  bar  three -2.026673 -0.322057
4  foo    two  0.411452 -0.954371
5  bar    two  0.765878 -0.095968
6  foo    one -0.654890  0.678091
7  foo  three -1.789842 -1.130922
"""

The following commands work:

df.groupby('A').apply(lambda x: (x['C'] - x['D']))
df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

but none of the following work:

df.groupby('A').transform(lambda x: (x['C'] - x['D']))
# KeyError or ValueError: could not broadcast input array from shape (5) into shape (5,3)

df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
# KeyError or TypeError: cannot concatenate a non-NDFrame object

Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:

# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)

In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?

For reference, below is the construction of the original dataframe above:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C' : randn(8), 'D' : randn(8)})

解决方案

Two major differences between apply and transform

There are two major differences between the transform and apply groupby methods.

  • Input:
  • apply implicitly passes all the columns for each group as a DataFrame to the custom function.
  • while transform passes each column for each group individually as a Series to the custom function.
  • Output:
  • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
  • The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

Inspecting the custom function

It can help quite a bit to inspect the input to your custom function passed to apply or transform.

Examples

Let's create some sample data and inspect the groups so that you can see what I am talking about:

import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 
                   'a':[4,5,1,3], 'b':[6,10,3,11]})

     State  a   b
0    Texas  4   6
1    Texas  5  10
2  Florida  1   3
3  Florida  3  11

Let's create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.

def inspect(x):
    print(type(x))
    raise

Now let's pass this function to both the groupby apply and transform methods to see what object is passed to it:

df.groupby('State').apply(inspect)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError

As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.

Now, let's do the same thing with transform

df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError

It is passed a Series - a totally different Pandas object.

So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:

def subtract_two(x):
    return x['a'] - x['b']

df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')

We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:

df.groupby('State').apply(subtract_two)

State     
Florida  2   -2
         3   -8
Texas    0   -2
         1   -5
dtype: int64

The output is a Series and a little confusing as the original index is kept, but we have access to all columns.


Displaying the passed pandas object

It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:

from IPython.display import display
def subtract_two(x):
    display(x)
    return x['a'] - x['b']

Screenshot:


Transform must return a single dimensional sequence the same size as the group

The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:

def return_three(x):
    return np.array([1, 2, 3])

df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group

The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:

def rand_group_len(x):
    return np.random.rand(len(x))

df.groupby('State').transform(rand_group_len)

          a         b
0  0.962070  0.151440
1  0.440956  0.782176
2  0.642218  0.483257
3  0.056047  0.238208


Returning a single scalar object also works for transform

If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:

def group_sum(x):
    return x.sum()

df.groupby('State').transform(group_sum)

   a   b
0  9  16
1  9  16
2  4  14
3  4  14

这篇关于在组对象上应用 vs 变换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆