在组对象上应用vs变换 [英] Apply vs transform on a group object
问题描述
请考虑以下数据框:
A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922
以下命令有效:
> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())
但以下任何一项均无效:
but none of the following work:
> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)
> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
TypeError: cannot concatenate a non-NDFrame object
为什么? 文档中的示例似乎建议在组上调用transform
可以进行行操作处理:
Why? The example on the documentation seems to suggest that calling transform
on a group allows one to do row-wise operation processing:
# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)
换句话说,我认为转换本质上是一种特定的应用类型(不聚合的类型).我在哪里错了?
In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?
作为参考,以下是上面原始数据帧的构造:
For reference, below is the construction of the original dataframe above:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8)})
推荐答案
apply
和transform
之间的两个主要区别
transform
和apply
groupby方法之间有两个主要区别.
Two major differences between apply
and transform
There are two major differences between the transform
and apply
groupby methods.
- 输入:
-
apply
将每个组的所有列作为 DataFrame 隐式传递给自定义函数. - 而
transform
将每个组的每一列分别作为系列传递给自定义函数.
- Input:
apply
implicitly passes all the columns for each group as a DataFrame to the custom function.- while
transform
passes each column for each group individually as a Series to the custom function.
- 传递给
apply
的自定义函数可以返回标量,或者返回Series或DataFrame(或numpy数组甚至是列表). - 传递给
transform
的自定义函数必须返回一个序列(一维系列,数组或列表)与组相同的长度.
- The custom function passed to
apply
can return a scalar, or a Series or DataFrame (or numpy array or even list). - The custom function passed to
transform
must return a sequence (a one dimensional Series, array or list) the same length as the group.
因此,
transform
一次仅可处理一个Series,而apply
一次可处理整个DataFrame.So,
transform
works on just one Series at a time andapply
works on the entire DataFrame at once.检查传递给
apply
或transform
的自定义函数的输入会很有帮助.It can help quite a bit to inspect the input to your custom function passed to
apply
ortransform
.让我们创建一些示例数据并检查这些组,以便您可以了解我在说什么:
Let's create some sample data and inspect the groups so that you can see what I am talking about:
import pandas as pd df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 'a':[4,5,1,3], 'b':[6,10,3,11]}) State a b 0 Texas 4 6 1 Texas 5 10 2 Florida 1 3 3 Florida 3 11
让我们创建一个简单的自定义函数,该函数打印出隐式传递的对象的类型,然后引发错误,以便可以停止执行.
Let's create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.
def inspect(x): print(type(x)) raise
现在让我们将此函数传递给groupby
apply
和transform
方法,以查看将哪些对象传递给它:Now let's pass this function to both the groupby
apply
andtransform
methods to see what object is passed to it:df.groupby('State').apply(inspect) <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'> RuntimeError
如您所见,DataFrame被传递到
inspect
函数中.您可能想知道为什么将DataFrame类型打印两次.熊猫两次参加第一组比赛.这样做是为了确定是否存在快速完成计算的方法.这是您不应该担心的次要细节.As you can see, a DataFrame is passed into the
inspect
function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.现在,让我们用
transform
df.groupby('State').transform(inspect) <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'> RuntimeError
它传递了一个Series-一个完全不同的Pandas对象.
It is passed a Series - a totally different Pandas object.
因此,
transform
一次只能使用一个系列. 并非不可能同时作用于两个色谱柱.因此,如果我们尝试从自定义函数内的b
中减去列a
,则会出现transform
错误.见下文:So,
transform
is only allowed to work with a single Series at a time. It is not impossible for it to act on two columns at the same time. So, if we try and subtract columna
fromb
inside of our custom function we would get an error withtransform
. See below:def subtract_two(x): return x['a'] - x['b'] df.groupby('State').transform(subtract_two) KeyError: ('a', 'occurred at index a')
当熊猫试图找到不存在的系列索引
a
时,我们收到一个KeyError.您可以使用apply
完成此操作,因为它具有整个DataFrame:We get a KeyError as pandas is attempting to find the Series index
a
which does not exist. You can complete this operation withapply
as it has the entire DataFrame:df.groupby('State').apply(subtract_two) State Florida 2 -2 3 -8 Texas 0 -2 1 -5 dtype: int64
输出是一个Series,并且保留了原始索引,因此有些混乱,但是我们可以访问所有列.
The output is a Series and a little confusing as the original index is kept, but we have access to all columns.
在自定义函数中显示整个pandas对象可以提供更大的帮助,因此您可以确切地看到正在使用的对象.您可以使用
print
语句,我喜欢使用IPython.display
模块中的display
函数,以便在Jupyter笔记本中以HTML形式很好地输出DataFrame:It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use
print
statements by I like to use thedisplay
function from theIPython.display
module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:from IPython.display import display def subtract_two(x): display(x) return x['a'] - x['b']
截屏:
另一个区别是
transform
必须返回与该组相同大小的一维序列.在此特定情况下,每个组都有两行,因此transform
必须返回两行的序列.如果没有,则会引发错误:The other difference is that
transform
must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, sotransform
must return a sequence of two rows. If it does not then an error is raised:def return_three(x): return np.array([1, 2, 3]) df.groupby('State').transform(return_three) ValueError: transform must return a scalar value for each group
该错误消息并不能真正说明问题.您必须返回与组相同长度的序列.因此,这样的功能将起作用:
The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:
def rand_group_len(x): return np.random.rand(len(x)) df.groupby('State').transform(rand_group_len) a b 0 0.962070 0.151440 1 0.440956 0.782176 2 0.642218 0.483257 3 0.056047 0.238208
返回单个标量对象也适用于
transform
如果您从自定义函数中仅返回一个标量,则
transform
将对组中的每一行使用它:
Returning a single scalar object also works for
transform
If you return just a single scalar from your custom function, then
transform
will use it for each of the rows in the group:def group_sum(x): return x.sum() df.groupby('State').transform(group_sum) a b 0 9 16 1 9 16 2 4 14 3 4 14
这篇关于在组对象上应用vs变换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
-