在函数中使用ddply并将感兴趣的变量包括在内作为参数 [英] Use ddply within a function and include variable of interest as an argument

查看:170
本文介绍了在函数中使用ddply并将感兴趣的变量包括在内作为参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对R比较陌生,因此尝试使用ddply&从plyr包中总结. 这篇文章差不多,但不是相当,回答了我的问题.我可以使用一些其他解释/说明.

I am relatively new to R, and trying to use ddply & summarise from the plyr package. This post almost, but not quite, answers my question. I could use some additional explanation/clarification.

我的问题:

我想创建一个简单的函数来按组总结给定变量的描述性统计信息.与链接的帖子不同,我想将感兴趣的变量作为函数的参数包括在内.正如已经在此站点上讨论的那样,此方法有效:

I want to create a simple function to summarize descriptive statistics, by group, for a given variable. Unlike the linked post, I would like to include the variable of interest as an argument to the function. As has already been discussed on this site, this works:

require(plyr)

ddply(mtcars, ~ cyl, summarise,
  mean = mean(hp),
  sd   = sd(hp),
  min  = min(hp),
  max  = max(hp)
)

但这不是:

descriptives_by_group <- function(dataset, group, x)
{
  ddply(dataset, ~ group, summarise,
    mean = mean(x),
    sd   = sd(x),
    min  = min(x),
    max  = max(x)
  )
}

descriptives_by_group(mtcars, cyl, hp)

由于要处理的数据量很大,所以我希望有一个函数可以让我指定自己感兴趣的变量以及数据集和分组变量.

Because of the volume of data with which I am working, I would like to be able to have a function that allows me to specify the variable of interest to me as well as the dataset and grouping variable.

我尝试编辑找到的各种解决方案

I have tried to edit the various solutions found here to address my problem, but I don't understand the code well enough to do it successfully.

原始海报使用以下示例数据集:

The original poster used the following example dataset:

a = c(1,2,3,4)
b = c(0,0,1,1)
c = c(5,6,7,8)
df = data.frame(a,b,c)
sv = c("b")

具有所需的输出:

  b Ave
1 0 1.5
2 1 3.5

Hadley认可的解决方案是:

And the solution endorsed by Hadley was:

myFunction <- function(x, y){
NewColName <- "a"
z <- ddply(x, y, .fun = function(xx,col){
                         c(Ave = mean(xx[,col],na.rm=TRUE))}, 
           NewColName)
return(z)
}

myFunction(df, sv)返回所需输出的位置.

Where myFunction(df, sv) returns the desired output.

在本示例中,我试图逐段分解代码,以查看是否通过更好地理解底层机制,可以修改代码以包含传递给函数的参数,是"NewColName"(您要获取有关其信息的变量).但是我没有任何成功.我的困难是我不了解(xx[,col])发生了什么.我知道mean(xx [,col])应采用数据帧xx的索引为col的列的平均值.但是我不知道匿名函数从哪里读取这些值.

I tried to break down the code piece-by-piece to see if, by getting a better understanding of the underlying mechanics, I could modify the code to include an argument to the function that would pass to what, in this example, is "NewColName" (the variable you want to get information about). But I am not having any success. My difficulty is that I do not understand what is happening with (xx[,col]). I know that mean(xx[,col]) should be taking the mean of the column with index col for the data frame xx. But I don't understand where the anonymous function is reading those values from.

有人可以帮我解析一下吗?我在琐碎的任务上浪费了时间,我可以用非常重复的代码和/或子集轻松地完成任务,但是我迷上了试图使我的脚本更简单,更优雅,理解这个问题的原因"以及其解决方案.

Could someone please help me parse this? I've wasted hours on a trivial task I could accomplish easily with very repetitive code and/or with subsetting, but I got hung up on trying to make my script more simple and elegant, and on understanding the "whys" of this problem and its solution(s).

PS我已经从psych包中研究了describeBy函数,但是据我所知,它不能让您指定要为其返回值的变量,因此不能解决我的问题.

PS I have looked into the describeBy function from the psych package, but as far as I can tell, it does not let you specify the variable(s) you want to return values for, and consequently does not solve my problem.

推荐答案

我只是在您提供的示例函数中移动了几件事,并展示了如何获取多列内容.这是您想要的吗?

I just moved a couple things around in the example function you gave and showed how to get more than one column back out. Does this do what you want?

myFunction2 <- function(x, y, col){
z <- ddply(x, y, .fun = function(xx){
                         c(mean = mean(xx[,col],na.rm=TRUE),
                         max = max(xx[,col],na.rm=TRUE) ) })
return(z)
}

myFunction2(mtcars, "cyl", "hp")

这篇关于在函数中使用ddply并将感兴趣的变量包括在内作为参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆