在Python/Pandas中创建部分SAS PROC Summary替换 [英] Creating a partial SAS PROC SUMMARY replacement in Python/Pandas
问题描述
我们正在努力摆脱SAS并转而使用Python/Pandas.但是,我们遇到的一件事是创建具有SAS例程灵活性的PROC SUMMARY
(AKA PROC MEANS
)替代品.对于非SAS用户:PROC SUMMARY
只是一个例程,用于生成表,该表包含数据集中所有观察值或一组观察值内变量的描述性统计信息",以解释SAS文档.我们的要求只是全部功能的一小部分-输出我们有的表:
We are working to get off of SAS and onto Python/Pandas. However, one thing we are having trouble with is creating a replacement for PROC SUMMARY
(AKA PROC MEANS
) that has the SAS routine's flexibility. For non-SAS users: PROC SUMMARY
is just a routine that produces a table containing "descriptive statistics for variables across all observations or within groups of observations" in a dataset, to paraphrase the SAS documentation. Our requirements are just a small subset of the full functionality - outputting a table where we have:
- 能够将不同的统计信息应用于不同的列(目前仅是计数,总和,均值,加权均值)
- 能够处理零到多个分组变量
- 能够为加权平均值指定权重变量
我们没有尝试做任何其他事情(任何图形化的操作,等等)
We are not trying to do anything else (anything graphical, etc.)
这是我们到目前为止所拥有的:
Here is what we have so far:
def wmean_ungrouped (d,w):
return (d.dot(w)).sum() / w.sum()
def wmean_grouped (group, var_name_in, var_name_weight):
d = group[var_name_in]
w = group[var_name_weight]
return (d * w).sum() / w.sum()
FUNCS = {
"mean" : np.mean ,
"sum" : np.sum ,
"count" : np.count_nonzero
}
def my_summary (
data ,
var_names_in ,
var_names_out ,
var_functions ,
var_name_weight = None ,
var_names_group = None
):
result = DataFrame()
if var_names_group is not None:
grouped = data.groupby (var_names_group)
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
if var_function == "wmean":
func = lambda x : wmean_grouped (x, var_name_in, var_name_weight)
result[var_name_out] = Series(grouped.apply(func))
else:
func = FUNCS[var_function]
result[var_name_out] = grouped[var_name_in].apply(func)
else:
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
if var_function == "wmean":
result[var_name_out] = \
Series(wmean_ungrouped(data[var_name_in], data[var_name_weight]))
else:
func = FUNCS[var_function]
result[var_name_out] = Series(func(data[var_name_in]))
return result
这是对my_summary()
函数的示例调用:
Here is a sample call to the my_summary()
function:
my_summary (
data=df,
var_names_in=["x_1","x_1","x_1","x_1"] ,
var_names_out=[
"x_1_c","x_1_s","x_1_m","x_1_wm"
] ,
var_functions=["count","sum","mean","wmean"] ,
var_name_weight="val_1" ,
var_names_group=["Region","Category"]
)
my_summary()
可以工作,但是如您所见,它的实现不是最漂亮的.主要问题如下:
my_summary()
works, but as you can see, its implementation is not the prettiest. Here are the main issues:
- 取决于已分组还是未分组的两个不同的代码路径-这完全源于以下事实:
DataFrame
和DataFrameGroupBy
具有将以编程方式选择的归约函数应用于单个列的不同方法.对于DataFrame
,我发现的唯一方法是直接调用func(data[var_name_in])
.data[var_name_in].apply(func)
不起作用,因为Series
上的apply()
不会减小(与DataFrame
上的apply()
不同).另一方面,对于DataFrameGroupBy
,我必须使用这种方法:grouped[var_name_in].apply(func)
.那是因为func(grouped[var_name_in])
之类的东西不起作用(没有理由应该如此). - 加权平均值的特殊处理-这是因为它在两列上运行,而与所有其他计算只在其中一列上进行计算不同;我不知道这是否可以帮助您.
- 两个不同的加权均值函数-这是第一个问题的结果.未分组的函数具有
Series
类型的参数,并且需要dot()
对其进行乘减.分组函数最终会处理SeriesGroupBy
对象,并且必须使用*
运算符(对
- Two different code paths depending on grouped or ungrouped - this stems completely from the fact that
DataFrame
andDataFrameGroupBy
have different ways for applying a programmatically-selected reducing function to a single column. ForDataFrame
, the only way I've found is directly invokingfunc(data[var_name_in])
.data[var_name_in].apply(func)
doesn't work becauseapply()
on aSeries
does not reduce (unlikeapply()
on aDataFrame
). On the other hand, forDataFrameGroupBy
, I have to use that very approach:grouped[var_name_in].apply(func)
. That's because something likefunc(grouped[var_name_in])
will not work (no reason it should.) - Special treatment for weighted mean - this is because it operates on two columns, unlike all the other calculations, which operate on just one; I don't know if this can be helped.
- Two different weighted mean functions - this is a consequence of the first issue. The ungrouped function has
Series
-type parameters and needsdot()
to multiply and reduce them; the grouped function eventually deals withSeriesGroupBy
objects and has to use the*
operator (acknowledgements to the answer to this SO post for the weighted average function code.)
所以我的问题是:
- 大熊猫是否有某种东西可以完成所有这些工作(即扔掉上面的东西并改用它)?
- 如果没有,是否对上述任何问题进行了修复?
- 有什么可能的话,什么都可以不进行分组-即从
DataFrame
获取DataFrameGroupBy
对象而不对任何变量进行分组?然后,将减少代码路径,因为我们将专门处理DataFrameGroupBy
接口.
- Is there something native to pandas that can do all of this (i.e. throw out the above and use that instead)?
- If not, are there any fixes to any of the issues mentioned above?
- By any chance, is there some way to group by nothing - that is, to obtain a
DataFrameGroupBy
object from aDataFrame
without grouping on any variable? Then the code paths would be reduced as we would be dealing with theDataFrameGroupBy
interface exclusively.
@JohnE的答案提供了一种按任何内容进行分组的方法:groupby(lambda x: True)
.这是他发现在本SO帖子中(顺便提一下,Wes本人回答说,有必要使用DataFrame.agg()
来达到相同的目的). @JohnE的出色解决方案使我们能够专门处理类型为DataFrameGroupBy
的对象,并立即减少了大多数代码路径.由于我们只有DataFrameGroupBy
实例,因此我可以使用一些可行的功能来进一步减少这种情况.基本上,所有函数都是根据需要生成的-生成器"(在此处加引号,以免与Python生成器表达式混淆)具有两个参数:值列名称和权重列名称,在所有情况下都将忽略其中的第二个参数wmean
.生成的函数始终应用于整个DataFrameGroupBy
,就像最初使用wmean
一样,其参数是要使用的正确列名.我还用熊猫计算替换了所有np.*
实现,以便更好地处理NaN
值.
@JohnE's answer provides a way to group by nothing: groupby(lambda x: True)
. This is a workaround that he spotted in this SO post (which, incidentally, features an answer from Wes himself speaking of the need for a DataFrame.agg()
, which would serve the same purpose). @JohnE's excellent solution allows us to deal exclusively with objects of type DataFrameGroupBy
, and instantly reduces most of the code paths. I was able to reduce further using some functional gimmickry that is now possible because we have only DataFrameGroupBy
instances. Basically, all functions are generated as needed - the "generators" (in quotes here so as to not be confused with Python generator expressions) take two parameters: value column name and weight column name, the second of which is ignored in all cases except wmean
. The generated functions are always applied over the entire DataFrameGroupBy
, as was originally the case just with wmean
, with the parameters being the correct column name(s) to use. I also replaced all the np.*
implementations with pandas calculations, to better deal with NaN
values.
除非大熊猫有本领可以做到这一点,否则这是我们的解决方案:
Unless there is something native to pandas that can do this, this is our solution:
FUNC_GENS = {
"mean" : lambda y,z : lambda x : x[y].mean(),
"sum" : lambda y,z : lambda x : x[y].sum() ,
"count" : lambda y,z : lambda x : x[y].count() ,
"wmean" : lambda y,z : lambda x : (x[y] * x[z]).sum() / x[z].sum()
}
def my_summary (
data ,
var_names_in ,
var_names_out ,
var_functions ,
var_name_weight = None ,
var_names_group = None ):
result = pd.DataFrame()
if var_names_group is None:
grouped = data.groupby (lambda x: True)
else:
grouped = data.groupby (var_names_group)
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
func_gen = FUNC_GENS[var_function]
func = func_gen (var_name_in, var_name_weight)
result[var_name_out] = grouped.apply(func)
return result
更新2019/当前解决方案
在我的原始帖子之后发布的熊猫版本现在实现了大多数此功能:
Update 2019 / Current Solution
Versions of pandas released subsequent to my original post now implement most of this functionality:
- 一无所获-Wes M.过去曾说过需要它确实是在0.20.0版中添加的以及从0.25.+ 版本开始现在是熊猫的一部分,格式为
- Grouping by nothing - Wes M. had spoken in the past of the need for a
DataFrame.agg()
and it was indeed added in version 0.20.0 along with aSeries.agg()
. - Multiple aggregations of multiple columns, with specified names for output columns - this is now part of pandas as of version 0.25.+ in the form of
NamedAgg
inputs to theagg()
function
因此,除加权平均值外,基本上所有其他内容.当前的一个很好的解决方案是此处.
So, basically everything except the weighted average. A good current solution for that is here.
推荐答案
好吧,这是一个可以解决两个问题的快捷方式(但是加权均值仍然需要使用其他函数).大多数情况下,它会使用此处的技巧(贷记为@DSM)以执行groupby(lamda x: True)
来避开您的空群组.如果在手段之类的东西上有一个权重"的扭曲,但据我所知没有,那将是很棒的.显然,有一个用于加权分位数的软件包,在此处基于numpy提及,但我不知道对此一无所知.伟大的项目顺便说一句!
Well, here's a quickie that does get at two issues (but still requires a different function for weighted mean). Mostly it uses the trick here (credit to @DSM) to get around your empty group by doing groupby(lamda x: True)
. It would be great if there was a kwarg for 'weights' on stuff like means but there is not, to my knowledge. Apparently there is a package for weighted quantiles mentioned here based on numpy but I don't know anything about it. Great project btw!
(请注意,名称与您的名称基本相同,我只是在wmean_grouped和my_summary中添加了"2",否则您可以使用相同的调用接口)
(note that names are mostly the same as yours, I just added a '2' to wmean_grouped and my_summary, otherwise you can use the same calling interface)
def wmean_grouped2 (group, var_name_in, var_name_weight):
d = group[var_name_in]
w = group[var_name_weight]
return (d * w).sum() / w.sum()
FUNCS = { "mean" : np.mean ,
"sum" : np.sum ,
"count" : np.count_nonzero }
def my_summary2 (
data ,
var_names_in ,
var_names_out ,
var_functions ,
var_name_weight = None ,
var_names_group = None ):
result = pd.DataFrame()
if var_names_group is None:
grouped = data.groupby (lambda x: True)
else:
grouped = data.groupby (var_names_group)
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
if var_function == "wmean":
func = lambda x : wmean_grouped2 (x, var_name_in, var_name_weight)
result[var_name_out] = pd.Series(grouped.apply(func))
else:
func = FUNCS[var_function]
result[var_name_out] = grouped[var_name_in].apply(func)
return result
这篇关于在Python/Pandas中创建部分SAS PROC Summary替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!