如何使用变量名称通过ddply引用数据框列? [英] How can I use variable names to refer to data frame columns with ddply?
问题描述
我正在尝试编写一个函数,该函数将保存时间序列数据的数据帧的名称和该数据帧中的列的名称作为参数.该函数对该数据执行各种操作,其中之一是在列中添加每年的运行总计.我正在使用plyr.
I am trying to write a function that takes as arguments the name of a data frame holding time series data and the name of a column in that data frame. The function performs various manipulations on that data, one of which is adding a running total for each year in a column. I am using plyr.
当我直接在ddply和cumsum中使用列名时,我没有问题:
When I use the name of the column directly with ddply and cumsum I have no problems:
require(plyr)
df <- data.frame(date = seq(as.Date("2007/1/1"),
by = "month",
length.out = 60),
sales = runif(60, min = 700, max = 1200))
df$year <- as.numeric(format(as.Date(df$date), format="%Y"))
df <- ddply(df, .(year), transform,
cum_sales = (cumsum(as.numeric(sales))))
这一切都很好,但是最终目的是能够将列名传递给该函数.当我尝试使用变量代替列名时,它没有按我预期的那样工作:
This is all well and good but the ultimate aim is to be able to pass a column name to this function. When I try to use a variable in place of the column name, it doesn't work as I expected:
mycol <- "sales"
df[mycol]
df <- ddply(df, .(year), transform,
cum_value2 = cumsum(as.numeric(df[mycol])))
我以为我知道如何通过名称访问列.这让我感到担忧,因为它表明我未能理解有关索引和提取的一些基本知识.我本以为以这种方式按名称引用列将是常见的需求.
I thought I knew how to access columns by name. This worries me because it suggests that I have failed to understand something basic about indexing and extraction. I would have thought that referring to columns by name in this way would be a common need.
我有两个问题.
- 我在做错什么,即我误解了什么?
- 是否有更好的方法来实现此功能,请记住该功能不会事先知道列的名称?
TIA
推荐答案
ddply的参数是在原始数据帧被分割成的每个部分的上下文中求值的表达式.您的df [myval]处理了整个数据帧,因此您无法按原样传递它(顺便说一句,为什么您需要这些as.numeric(as.character())东西-它们完全没用).
The arguments to ddply are expressions which are evaluated in the context of the each part the original data frame is split into. Your df[myval] addresses the whole data frame, so you cannot pass it as-is (btw, why do you need those as.numeric(as.character()) stuff - they are completely useless).
最简单的方法是编写自己的函数,该函数将在内部进行所有操作并向下传递列名,例如
The easiest way will be to write your own function which will does everything inside and pass the column name down, e.g.
df <- ddply(df,
.(year),
.fun = function(x, colname) transform(x, cum_sales = cumsum(x[,colname])),
colname = "sales")
这篇关于如何使用变量名称通过ddply引用数据框列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!