Julia中分组列的多个摘要统计信息 [英] Multiple summary statistics on grouped column in Julia
问题描述
我正在尝试下面的代码来与Julia(1.5.3)一起使用,它只是我正在尝试做的事情的代表.
I am trying below code to work with Julia(1.5.3), Its just a representation of what I am trying to do.
using DataFrames
using DataFramesMeta
using RDatasets
## setup
iris = dataset("datasets", "iris")
gdf = groupby(iris, :Species)
## Applying the split combine
## This code works fine
combine(gdf, nrow, (valuecols(gdf) .=> mean))
但是,当我尝试进行多次汇总操作时,它会失败
But, when I try to do it for multiple summary it fails
combine(gdf, nrow, (valuecols(gdf) .=> [mean, sum]))
错误:
错误:DimensionMismatch("数组无法广播到公共尺寸;尺寸为长度4和2"))
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 4 and 2")
对错误进行少量调试表明,如果我将代码更改为此:
Little debug on error suggests that If I change my code to this:
combine(gdf, nrow, ([:SepalLength, :PetalLength] .=> [mean,sum]))
## This code works but its still not correct as it doesn't tell me the mean and sum of both the columns , rather mean for SepalLength and sum for PetalLength, which was expected as per previous error
对此进行了更多研究,我意识到,我们可以做类似的事情,这个结果是正确的,但是结果是长表格而不是宽表格.我原以为这会给我答案,但是似乎无法按预期进行.
A little more research into it and I realized that, we can do something like this, this result is correct but the outcome is in long form of table not the wide form. I was expecting this would have given me the answer to my question, but it seems it doesn't work as expected.
combine(gdf, ([:SepalWidth, :PetalWidth] .=> x -> ([sum(x), mean(x)])))
## The code above works but output is 6x3 DataFrame, I was expecting 3x6 DataFrame
我的问题是:
有没有办法以这样的方式使用拆分组合,即获得如下所示的宽表(我已经将"do end"和"combine"一起使用来生成拆分表).我对这个解决方案还可以,但是我需要在这里输入所有列,是否有任何办法可以将所有汇总统计信息(总和,中位数,均值等)作为合并中提供的所有列的列.我希望我的问题很清楚,如果有重复或沟通不佳,请指出.谢谢
Is there any way to use split combine in such a way that I get a wide table like below (I have used "do end" with "combine" to generate it). I am okay with this solution, but I need to type out all the column here, Is there any way such that I can get all the summary stats(sum, median, mean etc) as columns for all the column provided in combine. I hope my question is clear, Please point out in case its a duplicate or its not well communicated. Thanks
combine(gdf) do x
return(sw_sum = sum(x.SepalWidth),
sw_mean = mean(x.SepalWidth),
sp_mean = mean(x.PetalWidth),
sp_sum = sum(x.PetalWidth)
)
end
## My expected answer should be similar to this
#3×5 DataFrame
# Row │ Species sw_sum sw_mean sp_mean sp_sum
# │ Cat… Float64 Float64 Float64 Float64
#─────┼────────────────────────────────────────────────
# 1 │ setosa 171.4 3.428 0.246 12.3
# 2 │ versicolor 138.5 2.77 1.326 66.3
# 3 │ virginica 148.7 2.974 2.026 101.3
而且,这可行:
combine(gdf, [:1] .=> [mean, sum, minimum, maximum,median])
但这并不会,并且会引发如上所述的尺寸错误,仍然让我为之困惑:
But this doesn't and throws the dimension error like above, still scratching my head over this:
combine(gdf, [:1, :2] .=> [mean, sum, minimum, maximum,median])
推荐答案
执行:
combine(gdf, nrow, vec(valuecols(gdf) .=> [mean sum]))
或
combine(gdf, nrow, (valuecols(gdf) .=> [mean sum])...)
或
combine(gdf, nrow, [n => f for n in valuecols(gdf) for f in [mean sum]])
(请注意,平均值
和 sum
之间没有逗号)
(note that there is no comma between mean
and sum
)
原因是您需要为广播的.=>
添加一个额外的维度,以获取所有输入组合.
The reason is that you need to add an additional dimension to broadcasted .=>
in order to get all combinations of inputs.
...
只是迭代一个集合,并将其元素作为连续的位置参数传递给该函数,例如:
...
just iterates a collection and passes its elements as consecutive positional arguments to the function, e.g.:
julia> f(x...) = x
f (generic function with 1 method)
julia> f(1, [2,3,4]...)
(1, 2, 3, 4)
这篇关于Julia中分组列的多个摘要统计信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!