R:在数据帧列循环中使用ddply [英] R: using ddply in a loop over data frame columns

查看:124
本文介绍了R:在数据帧列循环中使用ddply的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要根据数据框中列的子集中每个列中的值来计算并向数据框中添加多个新列.这些列均保存时间序列数据(有一个公共日期列).例如,我需要为上一列计算上一年同月的更改.我可以指定它们并分别进行计算,但是由于要转换的列很多,因此变得很繁琐,因此我尝试使用for循环来自动化该过程.

我一直做得很好,直到尝试使用ddply为到目前为止的年度总值创建一列.发生的情况是ddply在循环的每次迭代过程中都添加了新行,并将这些新行包括在cumsum计算中.我有两个问题.

问如何获得ddply以计算正确的总和? 问:如何在ddply调用期间指定列名,而不是使用虚拟值并在以后重命名?

require(lubridate)
require(plyr)
require(xts)

set.seed(12345)
# create dummy time series data
monthsback <- 24
startdate <- as.Date(paste(year(now()),month(now()),"1",sep = "-")) - months(monthsback)
mydf <- data.frame(mydate = seq(as.Date(startdate), by = "month", length.out = monthsback),
                   myvalue1 = runif(monthsback, min = 600, max = 800),
                   myvalue2 = runif(monthsback, min = 200, max = 300))

mydf$year <- as.numeric(format(as.Date(mydf$mydate), format="%Y"))
mydf$month <- as.numeric(format(as.Date(mydf$mydate), format="%m"))
newcolnames <- c('myvalue1','myvalue2')

for (i in seq_along(newcolnames)) {
    print(newcolnames[i])
    mydf$myxts <- xts(mydf[, newcolnames[i]], order.by = mydf$mydate)
    ## Calculate change over same month in previous year
    mylag <- 12
    mydf[, paste(newcolnames[i], "_yoy", sep = "", collapse = "")] <- as.numeric(diff(mydf$myxts, lag = mylag)/ lag(mydf$myxts, mylag))
    ## Calculate change over previous month
    mylag <- 1
    mydf[, paste(newcolnames[i], "_mom", sep = "", collapse = "")] <- as.numeric(diff(mydf$myxts, lag = mylag)/ lag(mydf$myxts, mylag))

    ## Calculate cumulative figure
    #mydf$newcol <- as.numeric(mydf$myxts)
    mydf$newcol <- 1
    mydf <- ddply(mydf, .(year), transform, newcol = cumsum(as.numeric(mydf$myxts)))
    colnames(mydf)[colnames(mydf)=="newcol"] <- paste(newcolnames[i], "_cuml", sep = "", collapse = "")

}

mydf

解决方案

在您的循环中,由于myxts不是数据框架的一部分,因此它不会在ddply语句中与其他所有内容一起拆分.更改为:

mydf$myxts <- xts(mydf[, newcolnames[i]], order.by = mydf$mydate)

我不知道如何在transform中使用动态生成的名称.

I need to calculate and add to a data frame multiple new columns based on the values in each column in a subset of columns in the data frame. These columns all hold time series data (there is a common date column). For example I need to calculate the change for the same month in the previous year for a dozen columns. I could specify them and calculate them individually but that becomes onerous with a large number of columns to transform, so I am trying to automate the process with a for loop.

I was doing OK until I tried to use ddply to create a column for the running total of the value for the year so far. What happens is that ddply is adding new rows during each iteration through the loop and including those new rows in the cumsum calculation. I have two questions.

Q. How can I get ddply to calculate the correct cumsum? Q. How can I specify the name of the column during the ddply call, rather than using a dummy value and renaming afterward?

[Edit: I spoke too soon, the updated code below does NOT work at this point, just FYI]

require(lubridate)
require(plyr)
require(xts)

set.seed(12345)
# create dummy time series data
monthsback <- 24
startdate <- as.Date(paste(year(now()),month(now()),"1",sep = "-")) - months(monthsback)
mydf <- data.frame(mydate = seq(as.Date(startdate), by = "month", length.out = monthsback),
                   myvalue1 = runif(monthsback, min = 600, max = 800),
                   myvalue2 = runif(monthsback, min = 200, max = 300))

mydf$year <- as.numeric(format(as.Date(mydf$mydate), format="%Y"))
mydf$month <- as.numeric(format(as.Date(mydf$mydate), format="%m"))
newcolnames <- c('myvalue1','myvalue2')

for (i in seq_along(newcolnames)) {
    print(newcolnames[i])
    mydf$myxts <- xts(mydf[, newcolnames[i]], order.by = mydf$mydate)
    ## Calculate change over same month in previous year
    mylag <- 12
    mydf[, paste(newcolnames[i], "_yoy", sep = "", collapse = "")] <- as.numeric(diff(mydf$myxts, lag = mylag)/ lag(mydf$myxts, mylag))
    ## Calculate change over previous month
    mylag <- 1
    mydf[, paste(newcolnames[i], "_mom", sep = "", collapse = "")] <- as.numeric(diff(mydf$myxts, lag = mylag)/ lag(mydf$myxts, mylag))

    ## Calculate cumulative figure
    #mydf$newcol <- as.numeric(mydf$myxts)
    mydf$newcol <- 1
    mydf <- ddply(mydf, .(year), transform, newcol = cumsum(as.numeric(mydf$myxts)))
    colnames(mydf)[colnames(mydf)=="newcol"] <- paste(newcolnames[i], "_cuml", sep = "", collapse = "")

}

mydf

解决方案

In your loop, since myxts is not part of the data frame, it is not split up in the ddply statement along with everything else. Change it to:

mydf$myxts <- xts(mydf[, newcolnames[i]], order.by = mydf$mydate)

I don't know of any way to use dynamically generated names with transform.

这篇关于R:在数据帧列循环中使用ddply的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆