如何在一个完全一般工作在data.table在R中使用变量中的列名 [英] How can one work fully generically in data.table in R with column names in variables

查看:786
本文介绍了如何在一个完全一般工作在data.table在R中使用变量中的列名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先:感谢@MattDowle; data.table 是我开始使用 R 之后
发生的最好的事情之一。 / p>

第二:我知道在 data.table 名称的各种用例的许多解决方法>,包括:


  1. 对数据表中的字段进行可变选择/分配

  2. 使用R中的变量在data.table中传递列名

  3. 按保存的名称引用data.table列变量

  4. 传递列以编程方式将名称更改为data.table

  5. Data.table元程序设计

  6. 如何编写一个函数来调用调用data.table的函数?

  7. 在data.table中使用动态列名称

  8. < a href =http://stackoverflow.com/questions/11745169/dynamic-column-names-in-data-table-r> data.table中的动态列名称,R

  9. 在数据中使用:=指定多个列。表格,按组

  10. 设置group by中的列名称使用data.table操作

  11. R用data.table总结多个列

,可能还没有引用。



但是:即使我学到了上面记录的所有技巧,以至于我
从来没有找到它们来提醒自己如何使用它们,我仍然会找到
使用作为参数传递给函数的列名称是
一个非常繁琐的任务。



我要找的是一个最佳实践批准替代
到以下解决方法/工作流。考虑
,我有一堆相似的数据列,并且希望对这些列或它们的集合执行一系列类似的操作,其中操作具有任意高的复杂性,并且列名组通过到变量中指定的每个操作。



我意识到这个问题的设计,但我遇到了惊人的频率。例子通常是那么乱,很难分离出与这个问题相关的特征,但是我最近偶然发现了一个简单易用的MWE:

  library(data.table)
library(lubridate)
library(zoo)

the.table< - data .table(year = 1991:1996,var1 = floor(runif(6,400,1400)))
the.table [,`:=`(var2 = var1 / floor(runif(6,2,5) ,
var3 = var1 / floor(runif(6,2,5)))]

#跨月复制数据
new.table< - the.table [ list(asofdate = seq(from = ymd((year)* 10 ^ 4 + 101),
length.out = 12,
by =1 month)),by = year]

#对一些组中的每个变量执行复杂的过程。
var.names< - c(var1,var2,var3)

for(varname in var.names){
#回答链接3上面
#转换列名为一个'quote'对象
quote.convert< - function(x)eval(parse(text = paste0('quote(',x, )')))

#对于每一个列名,我需要
varname< - quote.convert(varname)
anntot< - quote.convert每个滚动< - quote.convert(paste0(varname,。 .rolling))
scaled < - quote.convert(paste0(varname,。scaled))

#使用eval()执行相关任务
#around每个变量columnname我可能想要
new.table [,eval(anntot):=
the.table [,rep(eval(varname),each = 12)]]
new .table [,eval(monthly):=
the.table [,rep(eval(varname)/ 12,each = 12)]]
new.table [ b $ b rollapply(eval(monthly),mean,width = 12,
fill = c(head(eval(monthly),1),
tail
new.table [,eval(scaled):=
eval(anntot)/ sum(eval(rolling))* eval(rolling),
by = year]

当然,这里对数据和变量的特殊影响是无关紧要的,它或建议改进完成它在这种特定情况下完成。我正在寻找的是一个通用的策略,用于重复地将 data.table 操作的任意复杂过程应用于列或列表列表的列表,在变量中指定或作为参数传递给函数,其中过程必须以编程方式引用在变量/参数中命名的列,并且可能包括更新,联接,分组,对 data.table 特殊对象 .I .SD 但是一个比上面的一个更简单,更优雅,更短或更容易设计或实现或理解,或者需要频繁引用 -ing和 eval -ing。



特别要注意的是,因为程序可能相当复杂,涉及重复更新 data.table 引用更新的列,标准 lapply(.SD,...),... .SDcols = ... 方法通常不是一个可行的替代。还用 DT [[a.column.name]]替换 eval(a.column.name)的每个调用既不简单也不完全一般工作,因为这不能与其他 data.table 操作,我知道的玩得很好。

解决方案

我试图这样做在data.table认为这不是那么糟...但在一段尴尬的时间后,我放弃。马特说做零件然后加入,但我不能找出优雅的方式来做这些作品,特别是因为最后一个取决于以前的步骤。



我不得不说,这是一个非常漂亮的问题,我也经常遇到类似的问题。我爱data.table,但我仍然斗争有时。我不知道我是否在努力与data.table或复杂的问题。



这是我采取的不完整的方法。



实际上,我可以想象,将有更多的中间变量存储,这将有助于计算这些值。

  library(data.table)
库(动物园)

##示例年度数据
set.seed(27)
DT < - data.table(year = 1991:1996,
var1 = floor(runif(6,400,1400)))
DT [,var2:= var1 / floor(runif(6,2,5))]
DT [,var3:= var1 / floor 6,2,5))]
setkeyv(DT,colnames(DT)[1])$ ​​b $ b DT

##便利函数
nonkey< (dt){colnames(dt)[!colnames(dt)%in%key(dt)]}

##年度数据表示每月
NewDT < - DT [,j = list(asofdate = as.IDate(paste,year,1:12,1,sep = - ))),by = year]
setkeyv(NewDT,colnames(NewDT)[1:2])

##创建年度数据
NewDT_Annual < - NewDT [DT]
setnames(NewDT_Annual,
nonkey(NewDT_Annual),
paste0(nonkey ),.annual.total))

##计算每月数据
NewDT_Monthly< - NewDT [DT [,.SD / 12,keyby = list(year)]]
setnames(NewDT_Monthly,
nonkey(NewDT_Monthly),
paste0(nonkey(NewDT_Monthly),.monthly))

##计算滚动统计信息
NewDT_roll< - NewDT_Monthly [j = lapply(.SD,rollapply,mean,width = 12,
fill = c(.SD [1],tail(.SD,1))),
。 SDcols = nonkey(NewDT_Monthly)]
NewDT_roll < - cbind(NewDT_Monthly [,1:2,with = F],NewDT_roll)
setkeyv(NewDT_roll,colnames(NewDT_roll)[1:2])
setnames(NewDT_roll,
nonkey(NewDT_roll),
gsub(。monthly $,。rolling,nonkey(NewDT_roll)))

##归一化值

##计算调整表,即
##每个变量的总计,按滚动年份
##除以
##原始年度总计

##将调整值与每月数据合并,然后
##修改data.table,每个可变*年调整因子

##合并所有
NewDT_Combined< - NewDT_Annual [NewDT_roll] [NewDT_Monthly]


First of all: thanks to @MattDowle; data.table is among the best things that ever happened to me since I started using R.

Second: I am aware of many workarounds for various use cases of variable column names in data.table, including:

  1. Variably selecting/assigning to fields in a data.table
  2. pass column name in data.table using variable in R
  3. Referring to data.table columns by names saved in variables
  4. passing column names to data.table programmatically
  5. Data.table meta-programming
  6. How to write a function that calls a function that calls data.table?
  7. Using dynamic column names in `data.table`
  8. dynamic column names in data.table, R
  9. assign multiple columns using := in data.table, by group
  10. Setting column name in "group by" operation with data.table
  11. R summarizing multiple columns with data.table

and probably more I haven't referenced.

But: even if I learned all the tricks documented above to the point that I never had to look them up to remind myself how to use them, I still would find that working with column names that are passed as parameters to a function is an extremely tedious task.

What I'm looking for is a "best-practices-approved" alternative to the following workaround / workflow. Consider that I have a bunch of columns of similar data, and would like to perform a sequence of similar operations on these columns or sets of them, where the operations are of arbitrarily high complexity, and the groups of column names passed to each operation specified in a variable.

I realize this issue sounds contrived, but I run into it with surprising frequency. The examples are usually so messy that it is difficult to separate out the features relevant to this question, but I recently stumbled across one that was fairly straightforward to simplify for use as a MWE here:

library(data.table)
library(lubridate)
library(zoo)

the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
                var3=var1/floor(runif(6,2,5)))]

# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
                                           length.out=12,
                                           by="1 month")),by=year]

# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")

for(varname in var.names) {
    #As suggested in an answer to Link 3 above
    #Convert the column name to a 'quote' object
    quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')')))

    #Do this for every column name I'll need
    varname <- quote.convert(varname)
    anntot <- quote.convert(paste0(varname,".annual.total"))
    monthly <- quote.convert(paste0(varname,".monthly"))
    rolling <- quote.convert(paste0(varname,".rolling"))
    scaled <- quote.convert(paste0(varname,".scaled"))

    #Perform the relevant tasks, using eval()
    #around every variable columnname I may want
    new.table[,eval(anntot):=
               the.table[,rep(eval(varname),each=12)]]
    new.table[,eval(monthly):=
               the.table[,rep(eval(varname)/12,each=12)]]
    new.table[,eval(rolling):=
               rollapply(eval(monthly),mean,width=12,
                         fill=c(head(eval(monthly),1),
                                tail(eval(monthly),1)))]
    new.table[,eval(scaled):=
               eval(anntot)/sum(eval(rolling))*eval(rolling),
              by=year]
}

Of course, the particular effect on the data and variables here is irrelevant, so please do not focus on it or suggest improvements to accomplishing what it accomplishes in this particular case. What I am looking for, rather, is a generic strategy for the workflow of repeatedly applying an arbitrarily complicated procedure of data.table actions to a list of columns or list of lists-of-columns, specified in a variable or passed as an argument to a function, where the procedure must refer programmatically to columns named in the variable/argument, and possibly includes updates, joins, groupings, calls to the data.table special objects .I, .SD, etc.; BUT one which is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent quote-ing and eval-ing.

In particular please note that because the procedures can be fairly complex and involve repeatedly updating the data.table and then referencing the updated columns, the standard lapply(.SD,...), ... .SDcols = ... approach is usually not a workable substitute. Also replacing each call of eval(a.column.name) with DT[[a.column.name]] neither simplifies much nor works completely in general since that doesn't play nice with the other data.table operations, as far as I am aware.

解决方案

I tried to do this in data.table thinking "this isn't so bad"... but after an embarrassing length of time, I gave up. Matt says something like 'do in pieces then join', but I couldn't figure out elegant ways to do these pieces, especially because the last one depends on previous steps.

I have to say, this is a pretty brilliantly constructed question, and I too encounter similar issues frequently. I love data.table, but I still struggle sometimes. I don't know if I'm struggling with data.table or the complexity of the problem.

Here is the incomplete approach I've taken.

Realistically I can imagine that in a normal process you would have more intermediate variables stored that would be useful for calculating these values.

library(data.table)
library(zoo)

## Example yearly data
set.seed(27)
DT <- data.table(year=1991:1996,
                 var1=floor(runif(6,400,1400)))
DT[ , var2 := var1 / floor(runif(6,2,5))]
DT[ , var3 := var1 / floor(runif(6,2,5))]
setkeyv(DT,colnames(DT)[1])
DT

## Convenience function
nonkey <- function(dt){colnames(dt)[!colnames(dt)%in%key(dt)]}

## Annual data expressed monthly
NewDT <- DT[, j=list(asofdate=as.IDate(paste(year, 1:12, 1, sep="-"))), by=year]
setkeyv(NewDT, colnames(NewDT)[1:2])

## Create annual data
NewDT_Annual <- NewDT[DT]
setnames(NewDT_Annual, 
         nonkey(NewDT_Annual), 
         paste0(nonkey(NewDT_Annual), ".annual.total"))

## Compute monthly data
NewDT_Monthly <- NewDT[DT[ , .SD / 12, keyby=list(year)]]
setnames(NewDT_Monthly, 
         nonkey(NewDT_Monthly), 
         paste0(nonkey(NewDT_Monthly), ".monthly"))

## Compute rolling stats
NewDT_roll <- NewDT_Monthly[j = lapply(.SD, rollapply, mean, width=12, 
                                       fill=c(.SD[1],tail(.SD, 1))),
                            .SDcols=nonkey(NewDT_Monthly)]
NewDT_roll <- cbind(NewDT_Monthly[,1:2,with=F], NewDT_roll)
setkeyv(NewDT_roll, colnames(NewDT_roll)[1:2])
setnames(NewDT_roll, 
         nonkey(NewDT_roll), 
         gsub(".monthly$",".rolling",nonkey(NewDT_roll)))

## Compute normalized values

## Compute "adjustment" table which is 
## total of each variable, by year for rolling
## divided by
## original annual totals

## merge "adjustment values" in with monthly data, and then 
## make a modified data.table which is each varaible * annual adjustment factor

## Merge everything
NewDT_Combined <- NewDT_Annual[NewDT_roll][NewDT_Monthly]

这篇关于如何在一个完全一般工作在data.table在R中使用变量中的列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆