pandas 定制的群体聚集 [英] Pandas customized group aggregation
问题描述
我有一个关于熊猫和自定义组聚合的问题,以找到最有效的方式来计算我的价值.这是我的代码段:
I have a question regarding pandas and customised group aggregations to find the most efficient way to calculate my values. Here is my code snippet:
import pandas as pd
listA = list('abcdefghijklmnopqrstuvwxyz') * 2
listB = listA[::-1]
listC = listA[::2] * 2
listD = "Won"
data1 = range(52)
data2 = range(52,104)
data3 = range(104,156)
rawStructure = [('A', listA),
('B', listB),
('C', listC),
('D', listD),
('Data1', data1),
('Data2', data2),
('Data3', data3)]
df = pd.DataFrame.from_items(rawStructure, orient='columns')
df.loc[40:,"D"] = "Lost"
def customfct(x,y,z):
print('x',x)
data = round(((x.sum() + y.sum())/z.sum()) * 100,2)
return data
def f(row):
val1 = row.loc[(row['D'] == "Won"), 'Data1'].sum()
val2 = row.loc[(row['D'] == "Won"), 'Data2'].sum()
val3 = row.loc[(row['D'] == "Won"), 'Data3'].sum()
val4 = customfct(row.loc[(row['D'] == "Won"), 'Data1'], row.loc[(row['D'] == "Won"), 'Data2'], row.loc[(row['D'] == "Won"), 'Data3'])
return val1, val2, val3, val4
groupByCriteria = "C"
agg = df[:].groupby(by=groupByCriteria).apply(f)
print(agg)
我想知道是否有一种更有效的方法来进行分组和应用自定义计算(例如函数"customfct",它使用不同的列(Data1,Data2,Data3)).我的第一种方法是在这里看到类似的内容:http://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/,但是创建一个不限制一列的公式似乎是不可行的(例如,λx:max(x)-min(x)).此外,您将如何返回熊猫数据框而不是熊猫系列(带有元组)?预先感谢!
I would like to know if there is a more efficient way to make groupings and apply customised calculations (like function "customfct", which uses different columns (Data1, Data2,Data3)). My first approach was something like you could see here: http://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/ but it seems to be infeasible to create a formula which isn't constraint on one column (e.g lambda x: max(x) - min(x)). Furthermore, how would you return a pandas data frame instead of a pandas series (with a tuple)? Thanks in advance!
这是我当前的输出(是正确的,但是我想有一种更有效的方法):
That is my current output (which is correct, but I guess there is a more efficient way):
推荐答案
请考虑在一次 groupby()
调用中汇总所有 Data 列,然后为 val4 .然后将聚合合并回原始数据帧.
Consider aggregating all Data columns in one groupby()
call and then create a new column for val4. Then merge aggregation back to original dataframe.
# EQUIVALENT EXAMPLE DATA
listA = list('abcdefghijklmnopqrstuvwxyz') * 2
df = pd.DataFrame({'A': listA, 'B': listA[::-1], 'C': listA[::2] * 2,
'D': ["Won" for i in range(40)] + ["Lost" for i in range(40,52)],
'Data1': range(52), 'Data2': range(52,104), 'Data3': range(104,156)})
# ADJUSTED METHOD
groupByCriteria = "C"
grp = df[df['D']=="Won"].groupby(by=groupByCriteria).sum().reset_index()\
.rename(columns={'Data1':'val1','Data2':'val2','Data3':'val3'})
grp['val4'] = round(((grp['val1'] + grp['val2'])/grp['val3']) * 100,2)
agg = df.merge(grp, on='C').sort_values('Data1').reset_index(drop=True)
在时序比较中,调整后的代码明显更快.请注意:您的方法已调整为返回一个数据框而不是一个序列.
In timing comparisons, adjusted code is markedly faster. Do note: your method was adjusted to return a dataframe and not a series.
def origfct():
def customfct(x,y,z):
#print('x',x)
data = round(((x.sum() + y.sum())/z.sum()) * 100,2)
return data
def f(row):
row['val1'] = row.loc[(row['D'] == "Won"), 'Data1'].sum()
row['val2'] = row.loc[(row['D'] == "Won"), 'Data2'].sum()
row['val3'] = row.loc[(row['D'] == "Won"), 'Data3'].sum()
row['val4'] = customfct(row.loc[(row['D'] == "Won"), 'Data1'],
row.loc[(row['D'] == "Won"), 'Data2'],
row.loc[(row['D'] == "Won"), 'Data3'])
return row
groupByCriteria = "C"
agg = df[:].groupby(by=groupByCriteria).apply(f)
return agg
def newsetup():
groupByCriteria = "C"
grp = df[df['D']=="Won"].groupby(by=groupByCriteria).sum().reset_index()\
.rename(columns={'Data1':'val1','Data2':'val2','Data3':'val3'})
grp['val4'] = round(((grp['val1'] + grp['val2'])/grp['val3']) * 100,2)
agg = df.merge(grp, on='C').sort_values('Data1').reset_index(drop=True)
return agg
python -mtimeit -n'100' -s'import pyscript as test' 'test.origfct()'
# 100 loops, best of 3: 198 msec per loop
python -mtimeit -n'100' -s'import pyscript as test' 'test.newsetup()'
# 100 loops, best of 3: 16 msec per loop
这篇关于 pandas 定制的群体聚集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!