使用具有自定义功能的ddply + mutate吗? [英] Use of ddply + mutate with a custom function?
问题描述
我经常使用ddply
,但是从历史上来看,它经常使用summarize
(有时是mutate
),并且仅使用基本功能,例如mean()
,var1 - var2
等.我有一个数据集,我试图在其中应用一个自定义的,参与程度更高的函数,并开始尝试使用ddply
进行操作.我有一个成功的解决方案,但是我不明白为什么为什么可以这样工作,而对于更多常规"功能却如此.
I use ddply
quite frequently, but historically with summarize
(occasionally mutate
) and only basic functions like mean()
, var1 - var2
, etc. I have a dataset in which I'm trying to apply a custom, more involved function and started trying to dig into how to do this with ddply
. I've got a successful solution, but I don't understand why it works like this vs. for more "normal" functions.
相关
- 自定义ddply {plyr} ... 无法识别该功能
- 如何将变量传递给ddply中的自定义功能?
- r -help:[R]正确使用具有自己功能的ddply (我最终在此基础上得出了自己的解决方案)
- Custom Function not recognized by ddply {plyr}...
- How do I pass variables to a custom function in ddply?
- r-help: [R] Correct use of ddply with own function (I ended up basing my solution on this)
这是一个示例数据集:
library(plyr)
df <- data.frame(id = rep(letters[1:3], each = 3),
value = 1:9)
通常,我会像这样使用ddply
:
Normally, I'd use ddply
like so:
df_ply_1 <- ddply(df, .(id), mutate, mean = mean(value))
我对此的可视化是ddply
根据id
的组合组合将df
拆分为迷你"数据帧,然后通过对存在于其中的列名称调用mean()
来添加新列df
.因此,我尝试实现功能扩展了这个想法:
My visualization of this is that ddply
splits df
into "mini" data frames based on grouped combos of id
, and then I add a new column by calling mean()
on a column name that exists in df
. So, my attempt to implement a function extended this idea:
# actually, my logical extension of the above was to use:
# ddply(..., mean = function(value) { mean(value) })
df_ply_2 <- ddply(df, .(id), mutate,
mean = function(df) { mean(df$value) })
Error: attempt to replicate an object of type 'closure'
关于自定义函数的所有帮助均不适用mutate
,但这似乎前后矛盾,或者至少让我感到烦恼,因为与我实现的解决方案类似:
All the help on custom functions don't apply mutate
, but that seems inconsistent, or at least annoying to me, as the analog to my implemented solution is:
df_mean <- function(df) {
temp <- data.frame(mean = rep(mean(df$value), nrow(df)))
temp
}
df_ply_3 <- df
df_ply_3$mean <- ddply(df, .(id), df_mean)$mean
在线显示,看来我必须这样做:
In-line, looks like I have to do this:
df_ply_4 <- df
df_ply_4$mean <- ddply(df, .(id), function(x) {
temp <- data.frame(mean = rep(mean(x$value), length(x$value)))
temp})$mean
为什么不能将mutate
与自定义功能一起使用?仅仅是内置"函数返回了某种ddply
可以处理的类,而不得不踢出一个完整的data.frame
然后仅调出我关心的列吗?
Why can't I use mutate
with a custom function? Is it just that "built-in" functions return some sort of class that ddply
can deal with vs. having to kick out a full data.frame
and then call out only the column I care about?
感谢您帮助我搞定"!
在@Gregor回答后更新
很棒的答案,我想我现在明白了.的确,我确实对mutate
和summarize
的含义感到困惑...认为它们是ddply
的参数,涉及如何处理结果与实际上是 函数本身.因此,感谢您的深刻见解.
Awesome answer, and I think I now get it. I was, indeed, confused about what mutate
and summarize
meant... thinking they were arguments to ddply
regarding how to handle the result vs. actually being the functions themselves. So, thanks for that big insight.
此外,它确实有助于理解没有 mutate/summarize
,我需要返回data.frame
,这就是我必须cbind
带有列名的列的原因在返回的df
中.
Also, it really helped to understand that without mutate/summarize
, I need to return a data.frame
, which is the reason I have to cbind
a column with the name of the column in the df
that gets returned.
最后,如果我做使用mutate
,这对了解我可以返回向量结果并获得正确的结果很有帮助.因此,我可以做到这一点,阅读您的答案后,我现在已经明白了:
Lastly if I do use mutate
, it's helpful to now realize I can return a vector result and get the right result. Thus, I can do this, which I've now understood after reading your answer:
# I also caught that the code above doesn't do the right thing
# and recycles the single value returned by mean() vs. repeating it like
# I expected. Now that I know it's taking a vector, I know I need to return
# a vector the same length as my mini df
custom_mean <- function(x) {
rep(mean(x), length(x))
}
df_ply_5 <- ddply(df, .(id), mutate,
mean = custom_mean(value))
再次感谢您的深入解答!
Thanks again for your in-depth answer!
根据@Gregor的最新评论更新
嗯.由于此观察结果,我使用rep(mean(x), length(x))
作为df_ply_3
的结果(我承认在第一次撰写此文章时第一次运行时实际上并没有仔细观察它,我只是发现它并没有给我带来错误! ):
Hmmm. I used rep(mean(x), length(x))
due to this observation for df_ply_3
's result (I admit to not actually looking at it closely when I ran it the first time making this post, I just saw that it didn't give me an error!):
df_mean <- function(x) {
data.frame(mean = mean(x$value))
}
df_ply_3 <- df
df_ply_3$mean <- ddply(df, .(id), df_mean)$mean
df_ply_3
id value mean
1 a 1 2
2 a 2 5
3 a 3 8
4 b 4 2
5 b 5 5
6 b 6 8
7 c 7 2
8 c 8 5
9 c 9 8
因此,我基于3个id
变量重复3次这一事实,认为我的代码实际上是偶然的.因此,实际收益等于summarize
(每个id
值一行),并被回收.如果我像这样更新数据框,则测试该理论似乎是正确的:
So, I'm thinking that my code was actually an accident based on the fact that I had 3 id
variables repeated 3 times. Thus the actual return was the equivalent of summarize
(one row per id
value), and recycled. Testing that theory appears accurate if I update my data frame like so:
df <- data.frame(id = c(rep(letters[1:3], each = 3), "d"),
value = 1:10)
尝试将df_ply_3
方法与df_mean()
一起使用时出现错误:
I get an error when trying to use the df_ply_3
method with df_mean()
:
Error in `$<-.data.frame`(`*tmp*`, "mean", value = c(2, 5, 8, 10)) :
replacement has 4 rows, data has 10
因此,传递给df_mean
的mini df返回df
,其中mean
是在value
向量(返回一个值)时取平均值的结果.因此,我的输出只是三个值的data.frame
,每个id
组一个.我在想mutate
方式记得"它已经传递了一个小型数据帧,然后重复单个输出以匹配其长度吗?
So, the mini df passed to df_mean
returns a df
where mean
is the result of taking the mean if the value
vector (returns one value). So, my output was just a data.frame
of three values, one per id
group. I'm thinking the mutate
way sort of "remembers" that it was passed a mini data frame, and then repeats the single output to match it's length?
无论如何,感谢您对df_ply_5
发表评论;的确,如果我删除rep()
位并仅返回mean(x)
,效果很好!
In any case, thanks for commenting on df_ply_5
; indeed, if I remove the rep()
bit and just return mean(x)
, it works great!
推荐答案
您基本上是正确的. ddply
的确基于石斑鱼将您的数据分解为微型数据帧,并对每个数据块都应用了功能.
You're mostly right. ddply
indeed breaks your data down into mini data frames based on the grouper, and applies a function to each piece.
对于ddply
,所有工作都是通过数据帧完成的,因此.fun
自变量必须将一个(小型)数据帧作为输入,并返回一个数据帧作为输出.
With ddply
, all the work is done with data frames, so the .fun
argument must take a (mini) data frame as input and return a data frame as output.
mutate
和summarize
是适合此要求的函数(它们获取和返回数据帧).您可以查看他们各自的帮助页面,也可以在ddply
之外的数据框中运行它们,例如
mutate
and summarize
are functions that fit this bill (they take and return data frames). You can view their individual help pages, or run them on a data frame outside of ddply
to see this, e.g.
mutate(mtcars, mean.mpg = mean(mpg))
summarize(mtcars, mean.mpg = mean(mpg))
如果您不使用mutate
或summarize
,即您仅使用自定义函数,那么您的函数还需要使用( mini)数据帧作为参数,并返回一个数据帧.
If you don't use mutate
or summarize
, that is, you only use a custom function, then your function also needs to take a (mini) data frame as argument, and return a data frame.
如果您愿意使用mutate
或summarize
,则传递给ddply
的任何其他功能都不会被ddply
使用,它们只会被传递给mutate
或summarize
. mutate
和summarize
使用的函数作用于数据的列,而不作用于整个data.frame.这就是为什么
If you do use mutate
or summarize
, any other functions you pass to ddply
aren't used by ddply
, they're just passed on to be used by mutate
or summarize
. And functions used by mutate
and summarize
act on the columns of the data, not on the entire data.frame. This is why
ddply(mtcars, "cyl", mutate, mean.mpg = mean(mpg))
请注意,我们没有传递mutate
函数.我们不说ddply(mtcars, "cyl", mutate, mean)
.我们必须告诉它要表达什么意思.在?mutate
中,...
的描述是给出新列定义的命名参数",与功能无关. (mean()
与任何自定义功能"是否真的不同?)
Notice that we don't pass mutate
a function. We don't say ddply(mtcars, "cyl", mutate, mean)
. We have to tell it what to take the mean of. In ?mutate
, the description of ...
is "named parameters giving definitions of new columns", not anything to do with functions. (Is mean()
really different from any "custom function"? No.)
因此,它不适用于匿名函数-或根本不起作用.传递它一个表情!您可以预先定义一个自定义函数.
Thus it doesn't work with anonymous functions--or functions at all. Pass it an expression! You can define a custom function beforehand.
custom_function <- function(x) {mean(x + runif(length(x))}
ddply(mtcars, "cyl", mutate, jittered.mean.mpg = custom_function(mpg))
ddply(mtcars, "cyl", summarize, jittered.mean.mpg = custom_function(mpg))
这很好地扩展了,您可以具有接受多个参数的函数,并且可以为它们提供不同的列作为参数,但是如果您使用的是mutate
或summarize
,则必须给其他函数提供参数;您不只是传递函数.
This extends well, you can have functions that take multiple arguments, and you can give them different columns as arguments, but if you're using the mutate
or summarize
, you have to give the other functions arguments; you're not just passing the functions.
您似乎想要传递ddply
一个已经知道"平均值的函数.为此,我认为您不需要 使用mutate
或summarize
,但是您可以破解自己的版本.对于类似summarize
的行为,返回具有单个值的data.frame,对于类似mutate
的行为,返回具有附加值cbind
的原始data.frame
You seem to want to pass ddply
a function that already "knows" which column to take the mean of. For that, I think you'd need to not use mutate
or summarize
, but you can hack your own version. For summarize
-like behavior, return a data.frame with a single value, for mutate
-like behavior, return the original data.frame with your extra value cbind
ed on
mean.mpg.mutate = function(df) {
cbind.data.frame(df, mean.mpg = mean(df$mpg))
}
mean.mpg.summarize = function(df) {
data.frame(mean.mpg = mean(df$mpg))
}
ddply(mtcars, "cyl", mean.mpg.mutate)
ddply(mtcars, "cyl", mean.mpg.summarize)
tl; dr
为什么不能将mutate与自定义函数一起使用?仅仅是内置"函数返回某种ddply可以处理的类,而必须踢出一个完整的data.frame然后只调出我关心的列?
Why can't I use mutate with a custom function? Is it just that "built-in" functions return some sort of class that ddply can deal with vs. having to kick out a full data.frame and then call out only the column I care about?
恰恰相反! mutate
和summarize
将数据帧作为输入,并踢出数据帧作为返回.但是变异和总结是您要传递给ddply的功能,而不是指其他任何东西.
Quite the opposite! mutate
and summarize
take data frames as inputs and kick out data frames as returns. But mutate and summarize are the functions you're passing to ddply, not mean or whatever else.
更改和汇总是便捷功能,您在使用ddply
时会使用99%的时间.
Mutate and summarize are convenience functions that you'll use 99% of the time you use ddply
.
如果您不使用mutate/summitize,则您的函数需要获取并返回一个数据帧.
If you don't use mutate/summarize, then your function needs to take and return a data frame.
如果您确实使用了mutate/summaryize,则不传递它们的功能,而是传递可以用您的(小型)数据框求值的表达式.如果是变异的,则返回值应该是要附加到数据的向量(必要时回收).如果是汇总,则返回值应为单个值.您不会传递mean
之类的函数;您传递了一个表达式,例如mean(mpg)
.
If you do use mutate/summarize, then you don't pass them functions, you pass them expressions that can be evaluated with your (mini) data frame. If it's mutate, the return should be a vector to be appended to the data (recycled as necessary). If it's summarize, the return should be a single value. You don't pass a function, like mean
; you pass an expression, like mean(mpg)
.
这是在dplyr
是一件大事,或者至少是一件大事之前写的. dplyr
从此过程中消除了很多混乱,因为它实际上将ddply
的嵌套替换为mutate
或summarize
作为顺序函数group_by
后跟mutate
或summarize
的参数.我的答案的dplyr
版本为
This was written before dplyr
was a thing, or at least a big thing. dplyr
removes a lot of the confusion from this process because it essentially replaces the nesting of ddply
with mutate
or summarize
as arguments with sequential functions group_by
followed by mutate
or summarize
. The dplyr
version of my answer would be
library(dplyr)
group_by(mtcars, cyl) %>%
mutate(mean.mpg = mean(mpg))
将创建的新列直接传递给mutate
(或summarize
),因此不必混淆哪个函数可以执行什么操作.
With the new column creation passed directly to mutate
(or summarize
), there isn't confusion about which function does what.
这篇关于使用具有自定义功能的ddply + mutate吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!