Python Pandas 如何将 groupby 操作结果分配回父数据框中的列? [英] Python Pandas How to assign groupby operation results back to columns in parent dataframe?
问题描述
我在 IPython 中有以下数据框,其中每一行都是一只股票:
I have the following data frame in IPython, where each row is a single stock:
In [261]: bdata
Out[261]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21210 entries, 0 to 21209
Data columns:
BloombergTicker 21206 non-null values
Company 21210 non-null values
Country 21210 non-null values
MarketCap 21210 non-null values
PriceReturn 21210 non-null values
SEDOL 21210 non-null values
yearmonth 21210 non-null values
dtypes: float64(2), int64(1), object(4)
我想应用 groupby 操作,计算yearmonth"列中每个日期的所有内容的上限加权平均回报.
I want to apply a groupby operation that computes cap-weighted average return across everything, per each date in the "yearmonth" column.
这按预期工作:
In [262]: bdata.groupby("yearmonth").apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
Out[262]:
yearmonth
201204 -0.109444
201205 -0.290546
但是我想将这些值广播"回原始数据框中的索引,并将它们保存为日期匹配的常量列.
But then I want to sort of "broadcast" these values back to the indices in the original data frame, and save them as constant columns where the dates match.
In [263]: dateGrps = bdata.groupby("yearmonth")
In [264]: dateGrps["MarketReturn"] = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/mnt/bos-devrnd04/usr6/home/espears/ws/Research/Projects/python-util/src/util/<ipython-input-264-4a68c8782426> in <module>()
----> 1 dateGrps["MarketReturn"] = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
TypeError: 'DataFrameGroupBy' object does not support item assignment
我意识到这个幼稚的任务不应该奏效.但是,将 groupby 操作的结果分配到父数据帧上的新列的正确"Pandas 习惯用法是什么?
I realize this naive assignment should not work. But what is the "right" Pandas idiom for assigning the result of a groupby operation into a new column on the parent dataframe?
最后,我想要一个名为MarketReturn"的列,它是所有索引的重复常量值,这些索引的日期与 groupby 操作的输出相匹配.
In the end, I want a column called "MarketReturn" than will be a repeated constant value for all indices that have matching date with the output of the groupby operation.
实现这一目标的一个技巧如下:
One hack to achieve this would be the following:
marketRetsByDate = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
bdata["MarketReturn"] = np.repeat(np.NaN, len(bdata))
for elem in marketRetsByDate.index.values:
bdata["MarketReturn"][bdata["yearmonth"]==elem] = marketRetsByDate.ix[elem]
但这很慢,很糟糕,而且不符合 Python 风格.
But this is slow, bad, and unPythonic.
推荐答案
In [97]: df = pandas.DataFrame({'month': np.random.randint(0,11, 100), 'A': np.random.randn(100), 'B': np.random.randn(100)})
In [98]: df.join(df.groupby('month')['A'].sum(), on='month', rsuffix='_r')
Out[98]:
A B month A_r
0 -0.040710 0.182269 0 -0.331816
1 -0.004867 0.642243 1 2.448232
2 -0.162191 0.442338 4 2.045909
3 -0.979875 1.367018 5 -2.736399
4 -1.126198 0.338946 5 -2.736399
5 -0.992209 -1.343258 1 2.448232
6 -1.450310 0.021290 0 -0.331816
7 -0.675345 -1.359915 9 2.722156
这篇关于Python Pandas 如何将 groupby 操作结果分配回父数据框中的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!