使用具有自定义功能的ddply + mutate吗? [英] Use of ddply + mutate with a custom function?

查看:86
本文介绍了使用具有自定义功能的ddply + mutate吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常使用ddply,但是从历史上来看,它经常使用summarize(有时是mutate),并且仅使用基本功能,例如mean()var1 - var2等.我有一个数据集,我试图在其中应用一个自定义的,参与程度更高的函数,并开始尝试使用ddply进行操作.我有一个成功的解决方案,但是我不明白为什么为什么可以这样工作,而对于更多常规"功能却如此.

I use ddply quite frequently, but historically with summarize (occasionally mutate) and only basic functions like mean(), var1 - var2, etc. I have a dataset in which I'm trying to apply a custom, more involved function and started trying to dig into how to do this with ddply. I've got a successful solution, but I don't understand why it works like this vs. for more "normal" functions.

相关

  • Custom Function not recognized by ddply {plyr}...
  • How do I pass variables to a custom function in ddply?
  • r-help: [R] Correct use of ddply with own function (I ended up basing my solution on this)

这是一个示例数据集:

library(plyr)
df <- data.frame(id = rep(letters[1:3], each = 3),
                 value = 1:9)

通常,我会像这样使用ddply:

Normally, I'd use ddply like so:

df_ply_1 <- ddply(df, .(id), mutate, mean = mean(value))

我对此的可视化是ddply根据id的组合组合将df拆分为迷你"数据帧,然后通过对存在于其中的列名称调用mean()来添加新列df.因此,我尝试实现功能扩展了这个想法:

My visualization of this is that ddply splits df into "mini" data frames based on grouped combos of id, and then I add a new column by calling mean() on a column name that exists in df. So, my attempt to implement a function extended this idea:

# actually, my logical extension of the above was to use:
# ddply(..., mean = function(value) { mean(value) })
df_ply_2 <- ddply(df, .(id), mutate,
                  mean = function(df) { mean(df$value) })

Error: attempt to replicate an object of type 'closure'

关于自定义函数的所有帮助均不适用mutate,但这似乎前后矛盾,或者至少让我感到烦恼,因为与我实现的解决方案类似:

All the help on custom functions don't apply mutate, but that seems inconsistent, or at least annoying to me, as the analog to my implemented solution is:

df_mean <- function(df) {
    temp <- data.frame(mean = rep(mean(df$value), nrow(df)))
    temp
}

df_ply_3 <- df
df_ply_3$mean <- ddply(df, .(id), df_mean)$mean

在线显示,看来我必须这样做:

In-line, looks like I have to do this:

df_ply_4 <- df
df_ply_4$mean <- ddply(df, .(id), function(x) {
    temp <- data.frame(mean = rep(mean(x$value), length(x$value)))
    temp})$mean

为什么不能将mutate与自定义功能一起使用?仅仅是内置"函数返回了某种ddply可以处理的类,而不得不踢出一个完整的data.frame然后仅调出我关心的列吗?

Why can't I use mutate with a custom function? Is it just that "built-in" functions return some sort of class that ddply can deal with vs. having to kick out a full data.frame and then call out only the column I care about?

感谢您帮助我搞定"!

在@Gregor回答后更新

很棒的答案,我想我现在明白了.的确,我确实对mutatesummarize的含义感到困惑...认为它们是ddply的参数,涉及如何处理结果与实际上是 函数本身.因此,感谢您的深刻见解.

Awesome answer, and I think I now get it. I was, indeed, confused about what mutate and summarize meant... thinking they were arguments to ddply regarding how to handle the result vs. actually being the functions themselves. So, thanks for that big insight.

此外,它确实有助于理解没有 mutate/summarize,我需要返回data.frame,这就是我必须cbind带有列名的列的原因在返回的df中.

Also, it really helped to understand that without mutate/summarize, I need to return a data.frame, which is the reason I have to cbind a column with the name of the column in the df that gets returned.

最后,如果我使用mutate,这对了解我可以返回向量结果并获得正确的结果很有帮助.因此,我可以做到这一点,阅读您的答案后,我现在已经明白了:

Lastly if I do use mutate, it's helpful to now realize I can return a vector result and get the right result. Thus, I can do this, which I've now understood after reading your answer:

# I also caught that the code above doesn't do the right thing
# and recycles the single value returned by mean() vs. repeating it like
# I expected. Now that I know it's taking a vector, I know I need to return
# a vector the same length as my mini df
custom_mean <- function(x) {
    rep(mean(x), length(x))
}

df_ply_5 <- ddply(df, .(id), mutate,
              mean = custom_mean(value))

再次感谢您的深入解答!

Thanks again for your in-depth answer!

根据@Gregor的最新评论更新

嗯.由于此观察结果,我使用rep(mean(x), length(x))作为df_ply_3的结果(我承认在第一次撰写此文章时第一次运行时实际上并没有仔细观察它,我只是发现它并没有给我带来错误! ):

Hmmm. I used rep(mean(x), length(x)) due to this observation for df_ply_3's result (I admit to not actually looking at it closely when I ran it the first time making this post, I just saw that it didn't give me an error!):

df_mean <- function(x) {
    data.frame(mean = mean(x$value))
}

df_ply_3 <- df
df_ply_3$mean <- ddply(df, .(id), df_mean)$mean

df_ply_3
  id value mean
1  a     1    2
2  a     2    5
3  a     3    8
4  b     4    2
5  b     5    5
6  b     6    8
7  c     7    2
8  c     8    5
9  c     9    8

因此,我基于3个id变量重复3次这一事实,认为我的代码实际上是偶然的.因此,实际收益等于summarize(每个id值一行),并被回收.如果我像这样更新数据框,则测试该理论似乎是正确的:

So, I'm thinking that my code was actually an accident based on the fact that I had 3 id variables repeated 3 times. Thus the actual return was the equivalent of summarize (one row per id value), and recycled. Testing that theory appears accurate if I update my data frame like so:

df <- data.frame(id = c(rep(letters[1:3], each = 3), "d"),
                 value = 1:10)

尝试将df_ply_3方法与df_mean()一起使用时出现错误:

I get an error when trying to use the df_ply_3 method with df_mean():

Error in `$<-.data.frame`(`*tmp*`, "mean", value = c(2, 5, 8, 10)) : 
  replacement has 4 rows, data has 10

因此,传递给df_mean的mini df返回df,其中mean是在value向量(返回一个值)时取平均值的结果.因此,我的输出只是三个值的data.frame,每个id组一个.我在想mutate方式记得"它已经传递了一个小型数据帧,然后重复单个输出以匹配其长度吗?

So, the mini df passed to df_mean returns a df where mean is the result of taking the mean if the value vector (returns one value). So, my output was just a data.frame of three values, one per id group. I'm thinking the mutate way sort of "remembers" that it was passed a mini data frame, and then repeats the single output to match it's length?

无论如何,感谢您对df_ply_5发表评论;的确,如果我删除rep()位并仅返回mean(x),效果很好!

In any case, thanks for commenting on df_ply_5; indeed, if I remove the rep() bit and just return mean(x), it works great!

推荐答案

您基本上是正确的. ddply的确基于石斑鱼将您的数据分解为微型数据帧,并对每个数据块都应用了功能.

You're mostly right. ddply indeed breaks your data down into mini data frames based on the grouper, and applies a function to each piece.

对于ddply,所有工作都是通过数据帧完成的,因此.fun自变量必须将一个(小型)数据帧作为输入,并返回一个数据帧作为输出.

With ddply, all the work is done with data frames, so the .fun argument must take a (mini) data frame as input and return a data frame as output.

mutatesummarize是适合此要求的函数(它们获取和返回数据帧).您可以查看他们各自的帮助页面,也可以在ddply之外的数据框中运行它们,例如

mutate and summarize are functions that fit this bill (they take and return data frames). You can view their individual help pages, or run them on a data frame outside of ddply to see this, e.g.

mutate(mtcars, mean.mpg = mean(mpg))
summarize(mtcars, mean.mpg = mean(mpg))

如果您使用mutatesummarize,即您使用自定义函数,那么您的函数还需要使用( mini)数据帧作为参数,并返回一个数据帧.

If you don't use mutate or summarize, that is, you only use a custom function, then your function also needs to take a (mini) data frame as argument, and return a data frame.

如果您愿意使用mutatesummarize,则传递给ddply的任何其他功能都不会被ddply使用,它们只会被传递给mutatesummarize. mutatesummarize使用的函数作用于数据的列,而不作用于整个data.frame.这就是为什么

If you do use mutate or summarize, any other functions you pass to ddply aren't used by ddply, they're just passed on to be used by mutate or summarize. And functions used by mutate and summarize act on the columns of the data, not on the entire data.frame. This is why

ddply(mtcars, "cyl", mutate, mean.mpg = mean(mpg))

请注意,我们没有传递mutate函数.我们不说ddply(mtcars, "cyl", mutate, mean).我们必须告诉它要表达什么意思.在?mutate中,...的描述是给出新列定义的命名参数",与功能无关. (mean()与任何自定义功能"是否真的不同?)

Notice that we don't pass mutate a function. We don't say ddply(mtcars, "cyl", mutate, mean). We have to tell it what to take the mean of. In ?mutate, the description of ... is "named parameters giving definitions of new columns", not anything to do with functions. (Is mean() really different from any "custom function"? No.)

因此,它不适用于匿名函数-或根本不起作用.传递它一个表情!您可以预先定义一个自定义函数.

Thus it doesn't work with anonymous functions--or functions at all. Pass it an expression! You can define a custom function beforehand.

custom_function <- function(x) {mean(x + runif(length(x))}
ddply(mtcars, "cyl", mutate, jittered.mean.mpg = custom_function(mpg))
ddply(mtcars, "cyl", summarize, jittered.mean.mpg = custom_function(mpg))

这很好地扩展了,您可以具有接受多个参数的函数,并且可以为它们提供不同的列作为参数,但是如果您使用的是mutatesummarize,则必须给其他函数提供参数;您不只是传递函数.

This extends well, you can have functions that take multiple arguments, and you can give them different columns as arguments, but if you're using the mutate or summarize, you have to give the other functions arguments; you're not just passing the functions.

您似乎想要传递ddply一个已经知道"平均值的函数.为此,我认为您不需要 使用mutatesummarize,但是您可以破解自己的版本.对于类似summarize的行为,返回具有单个值的data.frame,对于类似mutate的行为,返回具有附加值cbind的原始data.frame

You seem to want to pass ddply a function that already "knows" which column to take the mean of. For that, I think you'd need to not use mutate or summarize, but you can hack your own version. For summarize-like behavior, return a data.frame with a single value, for mutate-like behavior, return the original data.frame with your extra value cbinded on

mean.mpg.mutate = function(df) {
    cbind.data.frame(df, mean.mpg = mean(df$mpg))
}

mean.mpg.summarize = function(df) {
    data.frame(mean.mpg = mean(df$mpg))
}

ddply(mtcars, "cyl", mean.mpg.mutate)
ddply(mtcars, "cyl", mean.mpg.summarize)

tl; dr

为什么不能将mutate与自定义函数一起使用?仅仅是内置"函数返回某种ddply可以处理的类,而必须踢出一个完整的data.frame然后只调出我关心的列?

Why can't I use mutate with a custom function? Is it just that "built-in" functions return some sort of class that ddply can deal with vs. having to kick out a full data.frame and then call out only the column I care about?

恰恰相反! mutatesummarize将数据帧作为输入,并踢出数据帧作为返回.但是变异和总结是您要传递给ddply的功能,而不是指其他任何东西.

Quite the opposite! mutate and summarize take data frames as inputs and kick out data frames as returns. But mutate and summarize are the functions you're passing to ddply, not mean or whatever else.

更改和汇总是便捷功能,您在使用ddply时会使用99%的时间.

Mutate and summarize are convenience functions that you'll use 99% of the time you use ddply.

如果您不使用mutate/summitize,则您的函数需要获取并返回一个数据帧.

If you don't use mutate/summarize, then your function needs to take and return a data frame.

如果您确实使用了mutate/summaryize,则不传递它们的功能,而是传递可以用您的(小型)数据框求值的表达式.如果是变异的,则返回值应该是要附加到数据的向量(必要时回收).如果是汇总,则返回值应为单个值.您不会传递mean之类的函数;您传递了一个表达式,例如mean(mpg).

If you do use mutate/summarize, then you don't pass them functions, you pass them expressions that can be evaluated with your (mini) data frame. If it's mutate, the return should be a vector to be appended to the data (recycled as necessary). If it's summarize, the return should be a single value. You don't pass a function, like mean; you pass an expression, like mean(mpg).

这是在dplyr是一件大事,或者至少是一件大事之前写的. dplyr从此过程中消除了很多混乱,因为它实际上将ddply的嵌套替换为mutatesummarize作为顺序函数group_by后跟mutatesummarize的参数.我的答案的dplyr版本为

This was written before dplyr was a thing, or at least a big thing. dplyr removes a lot of the confusion from this process because it essentially replaces the nesting of ddply with mutate or summarize as arguments with sequential functions group_by followed by mutate or summarize. The dplyr version of my answer would be

library(dplyr)
group_by(mtcars, cyl) %>%
    mutate(mean.mpg = mean(mpg))

将创建的新列直接传递给mutate(或summarize),因此不必混淆哪个函数可以执行什么操作.

With the new column creation passed directly to mutate (or summarize), there isn't confusion about which function does what.

这篇关于使用具有自定义功能的ddply + mutate吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆