在组对象上应用vs变换 [英] Apply vs transform on a group object

查看:72
本文介绍了在组对象上应用vs变换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下数据框:

     A      B         C         D
0  foo    one  0.162003  0.087469
1  bar    one -1.156319 -1.526272
2  foo    two  0.833892 -1.666304
3  bar  three -2.026673 -0.322057
4  foo    two  0.411452 -0.954371
5  bar    two  0.765878 -0.095968
6  foo    one -0.654890  0.678091
7  foo  three -1.789842 -1.130922

以下命令有效:

> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

但以下任何一项均无效:

but none of the following work:

> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
 TypeError: cannot concatenate a non-NDFrame object

为什么? 文档中的示例似乎建议在组上调用transform可以进行行操作处理:

Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:

# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)

换句话说,我认为转换本质上是一种特定的应用类型(不聚合的类型).我在哪里错了?

In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?

作为参考,以下是上面原始数据帧的构造:

For reference, below is the construction of the original dataframe above:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C' : randn(8), 'D' : randn(8)})

推荐答案

applytransform

之间的两个主要区别

transformapply groupby方法之间有两个主要区别.

Two major differences between apply and transform

There are two major differences between the transform and apply groupby methods.

  • 输入:
    • apply将每个组的所有列作为 DataFrame 隐式传递给自定义函数.
    • transform将每个组的每一列分别作为系列传递给自定义函数.
    • Input:
      • apply implicitly passes all the columns for each group as a DataFrame to the custom function.
      • while transform passes each column for each group individually as a Series to the custom function.
      • 传递给 apply的自定义函数可以返回标量,或者返回Series或DataFrame(或numpy数组甚至是列表).
      • 传递给 transform的自定义函数必须返回一个序列(一维系列,数组或列表)与组相同的长度.
      • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
      • The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

      因此,transform一次仅可处理一个Series,而apply一次可处理整个DataFrame.

      So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

      检查传递给applytransform的自定义函数的输入会很有帮助.

      It can help quite a bit to inspect the input to your custom function passed to apply or transform.

      让我们创建一些示例数据并检查这些组,以便您可以了解我在说什么:

      Let's create some sample data and inspect the groups so that you can see what I am talking about:

      import pandas as pd
      df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 
                         'a':[4,5,1,3], 'b':[6,10,3,11]})
      
           State  a   b
      0    Texas  4   6
      1    Texas  5  10
      2  Florida  1   3
      3  Florida  3  11
      

      让我们创建一个简单的自定义函数,该函数打印出隐式传递的对象的类型,然后引发错误,以便可以停止执行.

      Let's create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.

      def inspect(x):
          print(type(x))
          raise
      

      现在让我们将此函数传递给groupby applytransform方法,以查看将哪些对象传递给它:

      Now let's pass this function to both the groupby apply and transform methods to see what object is passed to it:

      df.groupby('State').apply(inspect)
      
      <class 'pandas.core.frame.DataFrame'>
      <class 'pandas.core.frame.DataFrame'>
      RuntimeError
      

      如您所见,DataFrame被传递到inspect函数中.您可能想知道为什么将DataFrame类型打印两次.熊猫两次参加第一组比赛.这样做是为了确定是否存在快速完成计算的方法.这是您不应该担心的次要细节.

      As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.

      现在,让我们用transform

      df.groupby('State').transform(inspect)
      <class 'pandas.core.series.Series'>
      <class 'pandas.core.series.Series'>
      RuntimeError
      

      它传递了一个Series-一个完全不同的Pandas对象.

      It is passed a Series - a totally different Pandas object.

      因此,transform一次只能使用一个系列. 并非不可能同时作用于两个色谱柱.因此,如果我们尝试从自定义函数内的b中减去列a,则会出现transform错误.见下文:

      So, transform is only allowed to work with a single Series at a time. It is not impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:

      def subtract_two(x):
          return x['a'] - x['b']
      
      df.groupby('State').transform(subtract_two)
      KeyError: ('a', 'occurred at index a')
      

      当熊猫试图找到不存在的系列索引a时,我们收到一个KeyError.您可以使用apply完成此操作,因为它具有整个DataFrame:

      We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:

      df.groupby('State').apply(subtract_two)
      
      State     
      Florida  2   -2
               3   -8
      Texas    0   -2
               1   -5
      dtype: int64
      

      输出是一个Series,并且保留了原始索引,因此有些混乱,但是我们可以访问所有列.

      The output is a Series and a little confusing as the original index is kept, but we have access to all columns.

      在自定义函数中显示整个pandas对象可以提供更大的帮助,因此您可以确切地看到正在使用的对象.您可以使用print语句,我喜欢使用IPython.display模块中的display函数,以便在Jupyter笔记本中以HTML形式很好地输出DataFrame:

      It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:

      from IPython.display import display
      def subtract_two(x):
          display(x)
          return x['a'] - x['b']
      

      截屏:

      另一个区别是transform必须返回与该组相同大小的一维序列.在此特定情况下,每个组都有两行,因此transform必须返回两行的序列.如果没有,则会引发错误:

      The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:

      def return_three(x):
          return np.array([1, 2, 3])
      
      df.groupby('State').transform(return_three)
      ValueError: transform must return a scalar value for each group
      

      该错误消息并不能真正说明问题.您必须返回与组相同长度的序列.因此,这样的功能将起作用:

      The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:

      def rand_group_len(x):
          return np.random.rand(len(x))
      
      df.groupby('State').transform(rand_group_len)
      
                a         b
      0  0.962070  0.151440
      1  0.440956  0.782176
      2  0.642218  0.483257
      3  0.056047  0.238208
      


      返回单个标量对象也适用于transform

      如果您从自定义函数中仅返回一个标量,则transform将对组中的每一行使用它:


      Returning a single scalar object also works for transform

      If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:

      def group_sum(x):
          return x.sum()
      
      df.groupby('State').transform(group_sum)
      
         a   b
      0  9  16
      1  9  16
      2  4  14
      3  4  14
      

      这篇关于在组对象上应用vs变换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆