如何完全通用地在 R 中的 data.table 中使用变量中的列名 [英] How can one work fully generically in data.table in R with column names in variables

查看:42
本文介绍了如何完全通用地在 R 中的 data.table 中使用变量中的列名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先:感谢@MattDowle;data.table 是最好的东西之一自从我开始使用 R 以来,我就遇到过这种情况.

First of all: thanks to @MattDowle; data.table is among the best things that ever happened to me since I started using R.

第二:我知道针对可变列的各种用例的许多变通方法data.table 中的名称,包括:

Second: I am aware of many workarounds for various use cases of variable column names in data.table, including:

  1. 选择/分配给 data.table 变量的名称是存储在一个字符向量中
  2. 使用变量在 data.table 中传递列名在 R
  3. 通过保存的名称引用 data.table 列在变量中
  4. 以编程方式将列名传递给 data.table
  5. Data.table 元编程
  6. 怎么写调用调用data.table的函数的函数?
  7. 在`data.table`中使用动态列名
  8. data.table 中的动态列名,R
  9. 在 data.table 中使用 := 分配多列,按组
  10. 在分组依据"中设置列名操作data.table
  11. R 使用 data.table 汇总多列

可能还有更多我没有提到的.

and probably more I haven't referenced.

但是:即使我学会了上面记录的所有技巧,我从来不用查找它们来提醒自己如何使用它们,我仍然会发现使用作为参数传递给函数的列名是一项极其繁琐的工作.

But: even if I learned all the tricks documented above to the point that I never had to look them up to remind myself how to use them, I still would find that working with column names that are passed as parameters to a function is an extremely tedious task.

我正在寻找的是最佳实践批准"的替代方案到以下解决方法/工作流程.考虑我有一堆类似数据的列,并且想对这些列或它们的集合执行一系列类似的操作,其中操作具有任意高的复杂性,并且将列名组传递给在中指定的每个操作一个变量.

What I'm looking for is a "best-practices-approved" alternative to the following workaround / workflow. Consider that I have a bunch of columns of similar data, and would like to perform a sequence of similar operations on these columns or sets of them, where the operations are of arbitrarily high complexity, and the groups of column names passed to each operation specified in a variable.

我意识到这个问题听起来是人为的,但我以惊人的频率遇到它.这些示例通常非常混乱,以至于很难将与此问题相关的功能区分开来,但我最近偶然发现了一个相当简单的示例,可以在此处用作 MWE:

I realize this issue sounds contrived, but I run into it with surprising frequency. The examples are usually so messy that it is difficult to separate out the features relevant to this question, but I recently stumbled across one that was fairly straightforward to simplify for use as a MWE here:

library(data.table)
library(lubridate)
library(zoo)

the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
                var3=var1/floor(runif(6,2,5)))]

# Replicate data across months
new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
                                           length.out=12,
                                           by="1 month")),by=year]

# Do a complicated procedure to each variable in some group.
var.names <- c("var1","var2","var3")

for(varname in var.names) {
    #As suggested in an answer to Link 3 above
    #Convert the column name to a 'quote' object
    quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')')))

    #Do this for every column name I'll need
    varname <- quote.convert(varname)
    anntot <- quote.convert(paste0(varname,".annual.total"))
    monthly <- quote.convert(paste0(varname,".monthly"))
    rolling <- quote.convert(paste0(varname,".rolling"))
    scaled <- quote.convert(paste0(varname,".scaled"))

    #Perform the relevant tasks, using eval()
    #around every variable columnname I may want
    new.table[,eval(anntot):=
               the.table[,rep(eval(varname),each=12)]]
    new.table[,eval(monthly):=
               the.table[,rep(eval(varname)/12,each=12)]]
    new.table[,eval(rolling):=
               rollapply(eval(monthly),mean,width=12,
                         fill=c(head(eval(monthly),1),
                                tail(eval(monthly),1)))]
    new.table[,eval(scaled):=
               eval(anntot)/sum(eval(rolling))*eval(rolling),
              by=year]
}

当然,这里对数据和变量的特定影响是无关紧要的,所以请不要关注它或建议改进以完成它在这种特殊情况下的作用.相反,我正在寻找的是一种通用策略,用于将任意复杂的 data.table 操作过程重复应用于列列表或列列表列表的工作流,在变量或作为参数传递给函数,其中过程必须以编程方式引用变量/参数中命名的列,并且可能包括更新、连接、分组、对 data.table 特殊对象的调用.I.SD等;但是,它比上面的或其他需要频繁 quote-ing 和 eval-ing 的更简单、更优雅、更短或更容易设计、实现或理解.

Of course, the particular effect on the data and variables here is irrelevant, so please do not focus on it or suggest improvements to accomplishing what it accomplishes in this particular case. What I am looking for, rather, is a generic strategy for the workflow of repeatedly applying an arbitrarily complicated procedure of data.table actions to a list of columns or list of lists-of-columns, specified in a variable or passed as an argument to a function, where the procedure must refer programmatically to columns named in the variable/argument, and possibly includes updates, joins, groupings, calls to the data.table special objects .I, .SD, etc.; BUT one which is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent quote-ing and eval-ing.

请特别注意,由于过程可能相当复杂,并且涉及重复更新 data.table 然后引用更新的列,因此标准的 lapply(.SD,...), ... .SDcols = ... 方法通常不是可行的替代方法.也用 DT[[a.column.name]] 替换 eval(a.column.name) 的每次调用既不会简化太多,也不会完全正常工作,因为那不会据我所知,t 与其他 data.table 操作配合得很好.

In particular please note that because the procedures can be fairly complex and involve repeatedly updating the data.table and then referencing the updated columns, the standard lapply(.SD,...), ... .SDcols = ... approach is usually not a workable substitute. Also replacing each call of eval(a.column.name) with DT[[a.column.name]] neither simplifies much nor works completely in general since that doesn't play nice with the other data.table operations, as far as I am aware.

推荐答案

您所描述的问题与 data.table 没有严格的关系.
复杂的查询无法轻松转换为机器可以解析的代码,因此我们无法在为复杂操作编写查询时避免复杂性.
您可以尝试想象如何使用 dplyrSQL 以编程方式为以下 data.table 查询构造查询:

Problem you are describing is not strictly related to data.table.
Complex queries cannot be easily translated to code that machine can parse, thus we are not able to escape complexity in writing a query for complex operations.
You can try to imagine how to programmatically construct a query for the following data.table query using dplyr or SQL:

DT[, c(f1(v1, v2, opt=TRUE),
       f2(v3, v4, v5, opt1=FALSE, opt2=TRUE),
       lapply(.SD, f3, opt1=TRUE, opt2=FALSE))
   , by=.(id1, id2)]

假设所有列(id1id2v1...v5)甚至选项(opt, opt1, opt2) 应该作为变量传递.

Assuming that all columns (id1, id2, v1...v5) or even options (opt, opt1, opt2) should be passed as variables.

由于查询表达的复杂性,我认为您无法轻松完成问题中所述的要求:

Because of complexity in expression of queries I don't think you could easily accomplish requirement stated in your question:

比上面的或其他需要频繁 quote-ing 和 eval-ing 的更简单、更优雅、更短或更容易设计、实现或理解.

is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent quote-ing and eval-ing.

尽管与其他编程语言相比,基础 R 提供了非常有用的工具来处理此类问题.

Although, comparing to other programming languages, base R provides very useful tools to deal with such problems.

您已经找到了使用 getmgetDT[[col_name]]parse 的建议,quote, eval.

You already found suggestions to use get, mget, DT[[col_name]], parse, quote, eval.

  • As you mentioned DT[[col_name]] might not play well with data.table optimizations, thus is not that useful here.
  • parse is probably the easiest way to construct complex queries as you can just operate on strings, but it doesn't provide basic language syntax validation. So you can ended up trying to parse a string that R parser does not accept. Additionally there is a security concern as presented in 2655#issuecomment-376781159.
  • get/mget are the ones most commonly suggested to deal with such problems. get and mget are internally catch by [.data.table and translated to expected columns. So you are assuming your arbitrary complex query will be able to be decomposed by [.data.table and expected columns properly inputted.
  • Since you asked this question few years back, the new feature - dot-dot prefix - is being rolled out in recently. You prefix variable name using dot-dot to refer to a variable outside of the scope of current data.table. Similarly as you refer parent directory in file system. Internals behind dot-dot will be quite similar to get, variables having prefix will be de-referenced inside of [.data.table. . In future releases dot-dot prefix may allow calls like:
col1="a"; col2="b"; col3="g"; col4="x"; col5="y"
DT[..col4==..col5, .(s1=sum(..col1), s2=sum(..col2)), by=..col3]

  • 我个人更喜欢 quoteeval 而不是.quoteeval 几乎被解释为从头开始手写.此方法不依赖 data.table 功能来管理对列的引用.我们可以期望所有优化的工作方式与您手动编写这些查询的方式相同.我发现调试也更容易,因为在任何时候你都可以打印带引号的表达式来查看实际传递给 data.table 查询的内容.此外,发生错误的空间更小.使用 R 语言对象构建复杂的查询有时很棘手,很容易将过程包装到函数中,因此它可以应用于不同的用例并易于重用.需要注意的是,此方法独立于 data.table.它使用 R 语言结构.您可以在官方 R 语言定义中找到更多信息语言计算一章.

    • Personally I prefer quote and eval instead. quote and eval is interpreted almost as written by hand from scratch. This method does not rely on data.table abilities to manage references to columns. We can expect all optimizations to work the same way as if you would write those queries by hand. I found it also easier to debug as at any point you can just print quoted expression to see what is actually passed to data.table query. Additionally there is a less space for bugs to occur. Constructing complex queries using R language object is sometimes tricky, it is easy to wrap the procedure into function so it can be applied in different use cases and easily re-used. Important to note that this method is independent from data.table. It uses R language constructs. You can find more information about that in official R Language Definition in Computing on the language chapter.

      还有什么?

      • 我在 中提交了一个名为 macro 的新概念的提案#1579简而言之,它是 DT[eval(qi), eval(qj), eval(qby)] 的包装器,因此您仍然需要对 R 语言对象进行操作.欢迎您在此处发表评论.
      • 最近我在 PR#4304 中提出了另一种元编程接口方法.简而言之,它使用新参数 env 将基本的 R substitute 功能插入到 [.data.table 中.
      • I submitted proposal of a new concept called macro in #1579. In short it is a wrapper on DT[eval(qi), eval(qj), eval(qby)] so you still have to operate on R language objects. You are welcome to put your comment there.
      • Recently I proposed another approach for metaprogramming interface in PR#4304. In short it plugs base R substitute functionality into [.data.table using new argument env.

      进入示例.下面我将展示两种方法来解决它.第一个将使用基本 R 元编程,第二个将使用元编程用于 PR 中提出的 data.table#4304(见上文).

      Going to the example. Below I will show two ways to solve it. First one will use base R metaprogramming, second one will use metaprogramming for data.table proposed in PR#4304 (see above).

      • 基于语言的 R 计算

      我会把所有的逻辑都封装到 do_vars 函数中.调用 do_vars(donot=TRUE) 将打印要在 data.table 上计算的表达式,而不是 eval 它们.下面的代码应该在 OP 代码之后运行.

      I will wrap all logic into do_vars function. Calling do_vars(donot=TRUE) will print expressions to be computed on data.table instead of eval them. Below code should be run just after the OP code.

      expected = copy(new.table)
      new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year]
      
      do_vars = function(x, y, vars, donot=FALSE) {
        name.suffix = function(x, suffix) as.name(paste(x, suffix, sep="."))
        do_var = function(var, x, y) {
          substitute({
            x[, .anntot := y[, rep(.var, each=12)]]
            x[, .monthly := y[, rep(.var/12, each=12)]]
            x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1)))]
            x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year]
          }, list(
            .var=as.name(var),
            .anntot=name.suffix(var, "annual.total"),
            .monthly=name.suffix(var, "monthly"),
            .rolling=name.suffix(var, "rolling"),
            .scaled=name.suffix(var, "scaled")
          ))
        }
        ql = lapply(setNames(nm=vars), do_var, x, y)
        if (donot) return(ql)
        lapply(ql, eval.parent)
        invisible(x)
      }
      do_vars(new.table, the.table, c("var1","var2","var3"))
      all.equal(expected, new.table)
      #[1] TRUE
      

      我们可以预览查询

      do_vars(new.table, the.table, c("var1","var2","var3"), donot=TRUE)
      #$var1
      #{
      #    x[, `:=`(var1.annual.total, y[, rep(var1, each = 12)])]
      #    x[, `:=`(var1.monthly, y[, rep(var1/12, each = 12)])]
      #    x[, `:=`(var1.rolling, rollapply(var1.monthly, mean, width = 12, 
      #        fill = c(head(var1.monthly, 1), tail(var1.monthly, 1))))]
      #    x[, `:=`(var1.scaled, var1.annual.total/sum(var1.rolling) * 
      #        var1.rolling), by = year]
      #}
      #
      #$var2
      #{
      #    x[, `:=`(var2.annual.total, y[, rep(var2, each = 12)])]
      #    x[, `:=`(var2.monthly, y[, rep(var2/12, each = 12)])]
      #    x[, `:=`(var2.rolling, rollapply(var2.monthly, mean, width = 12, 
      #        fill = c(head(var2.monthly, 1), tail(var2.monthly, 1))))]
      #    x[, `:=`(var2.scaled, var2.annual.total/sum(var2.rolling) * 
      #        var2.rolling), by = year]
      #}
      #
      #$var3
      #{
      #    x[, `:=`(var3.annual.total, y[, rep(var3, each = 12)])]
      #    x[, `:=`(var3.monthly, y[, rep(var3/12, each = 12)])]
      #    x[, `:=`(var3.rolling, rollapply(var3.monthly, mean, width = 12, 
      #        fill = c(head(var3.monthly, 1), tail(var3.monthly, 1))))]
      #    x[, `:=`(var3.scaled, var3.annual.total/sum(var3.rolling) * 
      #        var3.rolling), by = year]
      #}
      #
      

      • 提议的数据表元编程
      • expected = copy(new.table)
        new.table = the.table[, list(asofdate=seq(from=ymd((year)*10^4+101), length.out=12, by="1 month")), by=year]
        
        name.suffix = function(x, suffix) as.name(paste(x, suffix, sep="."))
        do_var2 = function(var, x, y) {
          x[, .anntot := y[, rep(.var, each=12)],
            env = list(
              .anntot = name.suffix(var, "annual.total"),
              .var = var
            )]
          x[, .monthly := y[, rep(.var/12, each=12)],
            env = list(
              .monthly = name.suffix(var, "monthly"),
              .var = var
            )]
          x[, .rolling := rollapply(.monthly, mean, width=12, fill=c(head(.monthly,1), tail(.monthly,1))),
            env = list(
              .rolling = name.suffix(var, "rolling"),
              .monthly = name.suffix(var, "monthly")
            )]
          x[, .scaled := .anntot/sum(.rolling)*.rolling, by=year,
            env = list(
              .scaled = name.suffix(var, "scaled"),
              .anntot = name.suffix(var, "annual.total"),
              .rolling = name.suffix(var, "rolling")
            )]
          TRUE
        }
        
        sapply(setNames(nm=var.names), do_var2, new.table, the.table)
        #var1 var2 var3 
        #TRUE TRUE TRUE 
        all.equal(expected, new.table)
        #[1] TRUE
        


        数据和更新的 OP 代码


        Data and updated OP code

        library(data.table)
        library(lubridate)
        library(zoo)
        
        the.table <- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
        the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
                        var3=var1/floor(runif(6,2,5)))]
        
        # Replicate data across months
        new.table <- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
                                                   length.out=12,
                                                   by="1 month")),by=year]
        
        # Do a complicated procedure to each variable in some group.
        var.names <- c("var1","var2","var3")
        
        for(varname in var.names) {
          #As suggested in an answer to Link 3 above
          #Convert the column name to a 'quote' object
          quote.convert <- function(x) eval(parse(text=paste0('quote(',x,')')))
          
          #Do this for every column name I'll need
          varname <- quote.convert(varname)
          anntot <- quote.convert(paste0(varname,".annual.total"))
          monthly <- quote.convert(paste0(varname,".monthly"))
          rolling <- quote.convert(paste0(varname,".rolling"))
          scaled <- quote.convert(paste0(varname,".scaled"))
          
          #Perform the relevant tasks, using eval()
          #around every variable columnname I may want
          new.table[,paste0(varname,".annual.total"):=
                      the.table[,rep(eval(varname),each=12)]]
          new.table[,paste0(varname,".monthly"):=
                      the.table[,rep(eval(varname)/12,each=12)]]
          new.table[,paste0(varname,".rolling"):=
                      rollapply(eval(monthly),mean,width=12,
                                fill=c(head(eval(monthly),1),
                                       tail(eval(monthly),1)))]
          new.table[,paste0(varname,".scaled"):=
                      eval(anntot)/sum(eval(rolling))*eval(rolling),
                    by=year]
        }
        

        这篇关于如何完全通用地在 R 中的 data.table 中使用变量中的列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆