Python Pandas:如何将一个全新的列添加到groupby / transform操作中的数据帧中 [英] Python Pandas: how to add a totally new column to a data frame inside of a groupby/transform operation

查看:657
本文介绍了Python Pandas:如何将一个全新的列添加到groupby / transform操作中的数据帧中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在我的数据中标记一些分位数,对于DataFrame的每一行,我想在一个新列中输入例如 xtile保存此值。



例如,假设我创建了一个这样的数据框架:

  import pandas,numpy as np 
dfrm = pandas.DataFrame({'A':np.random.rand(100),
'B':( 50+ np.random.randn(100)),
'C':np.random.randint(low = 0,high = 3,size =(100,))})

让我写我自己的函数来计算数组中每个元素的五分之一。我有自己的功能,但是例如只是参考scipy.stats.mstats.mquantile。

  import scipy.stats作为st 
def mark_quintiles(x,breakpoints):
#假设这是填入,使用st.mstats.mquantiles。
#返回一个与x相同形状的数组,其中x为x的条目为
#breakpoint-bucket的整数。

现在,真正的问题是如何使用 transform 向数据添加新列。这样的东西:

  def transformXtiles(dataFrame,inputColumnName,newColumnName,break):
dataFrame [newColumnName] = mark_quintiles (dataFrame [inputColumnName] .values,
break)
return dataFrame

然后:

  dfrm.groupby(C)。transform(lambda x:transformXtiles(x,A,A_xtile ,[0.2,0.4,0.6,0.8,1.0]))

问题是上述代码不会添加新列A_xtile。它只是返回我的数据帧不变。如果我首先添加一列充满虚拟值的列,例如NaN,称为A_xtile,那么它成功覆盖此列,以包含正确的五分标记。



但是,请先在列中填写任何可能要添加的内容,这非常不方便。



注意一个简单的应用不会在这里工作,因为它不会知道如何理解每个组的可能不同大小的结果数组。

解决方案

使用应用有什么问题?它适用于这个玩具示例,组长度是不同的:

 在[82]中:df 
输出[ 82]:
XY
0 0 -0.631214
1 0 0.783142
2 0 0.526045
3 1 -1.750058
4 1 1.163868
5 1 1.625538
6 1 0.076105
7 2 0.183492
8 2 0.541400
9 2 -0.672809

在[83]:def func(x):
....:x ['NewCol'] = np.nan
....:return x
....:

在[84 ]:df.groupby('X')。apply(func)
输出[84]:
XY NewCol
0 0 -0.631214 NaN
1 0 0.783142 NaN
2 0 0.526045 NaN
3 1 -1.750058 NaN
4 1 1.163868 NaN
5 1 1.625538 NaN
6 1 0.076105 NaN
7 2 0.183492 NaN
8 2 0.541400 NaN
9 2 -0.672809 NaN


I want to mark some quantiles in my data, and for each row of the DataFrame, I would like the entry in a new column called e.g. "xtile" to hold this value.

For example, suppose I create a data frame like this:

import pandas, numpy as np
dfrm = pandas.DataFrame({'A':np.random.rand(100), 
                         'B':(50+np.random.randn(100)), 
                         'C':np.random.randint(low=0, high=3, size=(100,))})

And let's say I write my own function to compute the quintile of each element in an array. I have my own function for this, but for example just refer to scipy.stats.mstats.mquantile.

import scipy.stats as st
def mark_quintiles(x, breakpoints):
    # Assume this is filled in, using st.mstats.mquantiles.
    # This returns an array the same shape as x, with an integer for which
    # breakpoint-bucket that entry of x falls into.

Now, the real question is how to use transform to add a new column to the data. Something like this:

def transformXtiles(dataFrame, inputColumnName, newColumnName, breaks):
    dataFrame[newColumnName] = mark_quintiles(dataFrame[inputColumnName].values, 
                                              breaks)
    return dataFrame

And then:

dfrm.groupby("C").transform(lambda x: transformXtiles(x, "A", "A_xtile", [0.2, 0.4, 0.6, 0.8, 1.0]))

The problem is that the above code will not add the new column "A_xtile". It just returns my data frame unchanged. If I first add a column full of dummy values, like NaN, called "A_xtile", then it does successfully over-write this column to include the correct quintile markings.

But it is extremely inconvenient to have to first write in the column for anything like this that I may want to add on the fly.

Note that a simple apply will not work here, since it won't know how to make sense of the possibly differently-sized result arrays for each group.

解决方案

What problems are you running into with apply? It works for this toy example here and the group lengths are different:

In [82]: df
Out[82]: 
   X         Y
0  0 -0.631214
1  0  0.783142
2  0  0.526045
3  1 -1.750058
4  1  1.163868
5  1  1.625538
6  1  0.076105
7  2  0.183492
8  2  0.541400
9  2 -0.672809

In [83]: def func(x):
   ....:     x['NewCol'] = np.nan
   ....:     return x
   ....: 

In [84]: df.groupby('X').apply(func)
Out[84]: 
   X         Y  NewCol
0  0 -0.631214     NaN
1  0  0.783142     NaN
2  0  0.526045     NaN
3  1 -1.750058     NaN
4  1  1.163868     NaN
5  1  1.625538     NaN
6  1  0.076105     NaN
7  2  0.183492     NaN
8  2  0.541400     NaN
9  2 -0.672809     NaN

这篇关于Python Pandas:如何将一个全新的列添加到groupby / transform操作中的数据帧中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆