pandas transform()vs apply() [英] Pandas transform() vs apply()
问题描述
我不明白为什么在同一数据帧上调用apply
和transform
时会返回不同的dtype.我之前对自己解释这两个函数的方式大致遵循"apply
折叠数据,而transform
做与apply
完全相同的事情,但保留了原始索引,并且不折叠".请考虑以下内容.
I don't understand why apply
and transform
return different dtypes when called on the same data frame. The way I explained the two functions to myself before went something along the lines of "apply
collapses the data, and transform
does exactly the same thing as apply
but preserves the original index and doesn't collapse." Consider the following.
df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
'cat': [1,1,0,0,1,0,0,0,0,1]})
让我们确定在cat
列中具有非零条目的id
.
Let's identify those id
s which have a nonzero entry in the cat
column.
>>> df.groupby('id')['cat'].apply(lambda x: (x == 1).any())
id
1 True
2 True
3 False
4 True
Name: cat, dtype: bool
太好了.但是,如果要创建指标列,则可以执行以下操作.
Great. If we wanted to create an indicator column, however, we could do the following.
>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 0
8 0
9 1
Name: cat, dtype: int64
我不明白为什么dtype现在是int64
而不是any()
函数返回的布尔值.
I don't understand why the dtype is now int64
instead of the boolean returned by the any()
function.
当我将原始数据帧更改为包含一些布尔值时(请注意,零仍然存在),变换方法将在object
列中返回布尔值.对我来说,这是个额外的谜,因为所有值都是布尔值,但是它被列为object
显然是与原始整数和布尔值的混合类型列的dtype
匹配.
When I change the original data frame to contain some booleans (note that the zeros remain), the transform approach returns booleans in an object
column. This is an extra mystery to me since all of the values are boolean, but it's listed as object
apparently to match the dtype
of the original mixed-type column of integers and booleans.
df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
'cat': [True,True,0,0,True,0,0,0,0,True]})
>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
Name: cat, dtype: object
但是,当我使用所有布尔值时,transform函数将返回一个布尔值列.
However, when I use all booleans, the transform function returns a boolean column.
df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
'cat': [True,True,False,False,True,False,False,False,False,True]})
>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
Name: cat, dtype: bool
使用我的敏锐模式识别技能,看来结果列的dtype
与原始列的相似.我会很感激为什么会发生这种情况,或者transform
函数的内幕是怎么回事.干杯.
Using my acute pattern-recognition skills, it appears that the dtype
of the resulting column mirrors that of the original column. I would appreciate any hints about why this occurs or what's going on under the hood in the transform
function. Cheers.
推荐答案
看起来SeriesGroupBy.transform()
试图将结果dtype转换为与原始列相同的值,但DataFrameGroupBy.transform()
似乎没有做到这一点:
It looks like SeriesGroupBy.transform()
tries to cast the result dtype to the same one as the original column has, but DataFrameGroupBy.transform()
doesn't seem to do that:
In [139]: df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
Out[139]:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 0
8 0
9 1
Name: cat, dtype: int64
# v v
In [140]: df.groupby('id')[['cat']].transform(lambda x: (x == 1).any())
Out[140]:
cat
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
In [141]: df.dtypes
Out[141]:
cat int64
id int64
dtype: object
这篇关于 pandas transform()vs apply()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!