Python Pandas：如何将一个全新的列添加到groupby / transform操作中的数据帧中 [英] Python Pandas: how to add a totally new column to a data frame inside of a groupby/transform operation

查看：657 发布时间：2017/3/26 1:04:42 python group-by transform dataframe pandas

本文介绍了Python Pandas：如何将一个全新的列添加到groupby / transform操作中的数据帧中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想在我的数据中标记一些分位数，对于DataFrame的每一行，我想在一个新列中输入例如 xtile保存此值。

例如，假设我创建了一个这样的数据框架：

  import pandas，numpy as np 
 dfrm = pandas.DataFrame（{'A'：np.random.rand（100），
'B':( 50+ np.random.randn（100）），
'C'：np.random.randint（low = 0，high = 3，size =（100，））}）

让我写我自己的函数来计算数组中每个元素的五分之一。我有自己的功能，但是例如只是参考scipy.stats.mstats.mquantile。

  import scipy.stats作为st 
 def mark_quintiles（x，breakpoints）：
＃假设这是填入，使用st.mstats.mquantiles。 
＃返回一个与x相同形状的数组，其中x为x的条目为
＃breakpoint-bucket的整数。

现在，真正的问题是如何使用 transform 向数据添加新列。这样的东西：

  def transformXtiles（dataFrame，inputColumnName，newColumnName，break）：
 dataFrame [newColumnName] = mark_quintiles （dataFrame [inputColumnName] .values，
 break）
 return dataFrame

然后：

  dfrm.groupby（C）。transform（lambda x：transformXtiles（x，A，A_xtile ，[0.2,0.4,0.6,0.8,1.0]））

问题是上述代码不会添加新列A_xtile。它只是返回我的数据帧不变。如果我首先添加一列充满虚拟值的列，例如NaN，称为A_xtile，那么它将成功覆盖此列，以包含正确的五分标记。

但是，请先在列中填写任何可能要添加的内容，这非常不方便。

注意一个简单的应用不会在这里工作，因为它不会知道如何理解每个组的可能不同大小的结果数组。

解决方案

使用应用有什么问题？它适用于这个玩具示例，组长度是不同的：

 在[82]中：df 
输出[ 82]：
 XY 
 0 0 -0.631214 
 1 0 0.783142 
 2 0 0.526045 
 3 1 -1.750058 
 4 1 1.163868 
 5 1 1.625538 
 6 1 0.076105 
 7 2 0.183492 
 8 2 0.541400 
 9 2 -0.672809 
 
在[83]：def func（x）： 
 ....：x ['NewCol'] = np.nan 
 ....：return x 
 ....：
 
在[84 ]：df.groupby（'X'）。apply（func）
输出[84]：
 XY NewCol 
 0 0 -0.631214 NaN 
 1 0 0.783142 NaN 
 2 0 0.526045 NaN 
 3 1 -1.750058 NaN 
 4 1 1.163868 NaN 
 5 1 1.625538 NaN 
 6 1 0.076105 NaN 
 7 2 0.183492 NaN 
 8 2 0.541400 NaN 
 9 2 -0.672809 NaN

I want to mark some quantiles in my data, and for each row of the DataFrame, I would like the entry in a new column called e.g. "xtile" to hold this value.

For example, suppose I create a data frame like this:

import pandas, numpy as np
dfrm = pandas.DataFrame({'A':np.random.rand(100), 
                         'B':(50+np.random.randn(100)), 
                         'C':np.random.randint(low=0, high=3, size=(100,))})

And let's say I write my own function to compute the quintile of each element in an array. I have my own function for this, but for example just refer to scipy.stats.mstats.mquantile.

import scipy.stats as st
def mark_quintiles(x, breakpoints):
    # Assume this is filled in, using st.mstats.mquantiles.
    # This returns an array the same shape as x, with an integer for which
    # breakpoint-bucket that entry of x falls into.

Now, the real question is how to use transform to add a new column to the data. Something like this:

def transformXtiles(dataFrame, inputColumnName, newColumnName, breaks):
    dataFrame[newColumnName] = mark_quintiles(dataFrame[inputColumnName].values, 
                                              breaks)
    return dataFrame

And then:

dfrm.groupby("C").transform(lambda x: transformXtiles(x, "A", "A_xtile", [0.2, 0.4, 0.6, 0.8, 1.0]))

The problem is that the above code will not add the new column "A_xtile". It just returns my data frame unchanged. If I first add a column full of dummy values, like NaN, called "A_xtile", then it does successfully over-write this column to include the correct quintile markings.

But it is extremely inconvenient to have to first write in the column for anything like this that I may want to add on the fly.

Note that a simple apply will not work here, since it won't know how to make sense of the possibly differently-sized result arrays for each group.

解决方案

What problems are you running into with apply? It works for this toy example here and the group lengths are different:

In [82]: df
Out[82]: 
   X         Y
0  0 -0.631214
1  0  0.783142
2  0  0.526045
3  1 -1.750058
4  1  1.163868
5  1  1.625538
6  1  0.076105
7  2  0.183492
8  2  0.541400
9  2 -0.672809

In [83]: def func(x):
   ....:     x['NewCol'] = np.nan
   ....:     return x
   ....: 

In [84]: df.groupby('X').apply(func)
Out[84]: 
   X         Y  NewCol
0  0 -0.631214     NaN
1  0  0.783142     NaN
2  0  0.526045     NaN
3  1 -1.750058     NaN
4  1  1.163868     NaN
5  1  1.625538     NaN
6  1  0.076105     NaN
7  2  0.183492     NaN
8  2  0.541400     NaN
9  2 -0.672809     NaN

这篇关于Python Pandas：如何将一个全新的列添加到groupby / transform操作中的数据帧中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python Pandas：如何将一个全新的列添加到groupby / transform操作中的数据帧中 [英] Python Pandas: how to add a totally new column to a data frame inside of a groupby/transform operation

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python Pandas：如何将一个全新的列添加到groupby / transform操作中的数据帧中 [英] Python Pandas: how to add a totally new column to a data frame inside of a groupby/transform operation

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭