r数据表功能编程/元编程/计算语言 [英] r data.table functional programming / metaprogramming / computing on the language

查看:181
本文介绍了r数据表功能编程/元编程/计算语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在探索使用data.table包装聚合函数的不同方式(但实际上可以是任何类型的函数)(还提供了一个dplyr示例),并且想知道关于函数编程/元编程的最佳实践




  • 性能(关于数据表可能适用的潜在优化的实现问题)

  • 可读性(有一个普遍认可的标准,例如在大多数使用data.table的包中)

  • 易于泛化(元编程是泛化的区别)



基本应用程序是灵活地聚合表,即将变量参数化为聚合,要聚合的维度,各自生成的变量名称和聚合功能。我已经在三个data.table和一个dplyr方法中实现了(几乎)相同的功能:


  1. fn_dt_agg1(这里我无法弄清楚如何参数化聚合函数)

  2. fn_dt_agg2(灵感来自@jangorecki的答案这里

  3. fn_dt_agg3(灵感来自@Arun的答案这里,这似乎是元编程的另一种方法)

  4. fn_df_agg1(我在dplyr中的同样的方法)

 库(data.table)
库(dplyr)

数据

  n_size< 1 * 10 ^ 6 
sample_metrics< - sample(seq(from = 1,to = 100,by = 1),n_size, rep = T)
sample_dimensions< - sample(letters [10:12],n_size,rep = T)
df< -
data.frame(
a = sample_metrics,
b = sample_metrics,
c = sample_dimensions,
d = sample_dimensions,
x = sample_metrics,
y = sample_dimensions,
stringsAsFactors = F)

dt< - as.data.table(df)

实施



1。 fn_dt_agg1

  fn_dt_agg1<  -  
函数(dt,metric,metric_name,dimension,dimension_name){

temp < - dt [,setNames(lapply(.SD,function(x){sum(x,na.rm = T)}),
metric_name),
keyby = dimension,.SDcols = metric]
temp []
}

res_dt1 < -
fn_dt_agg1(
dt = dt,metric = c(a,b),metric_name = c(a,b),
dimension = c(c,d),dimension_name = c(c d))

2。 fn_dt_agg2

  fn_dt_agg2 < -  
function(dt,metric,metric_name,dimension,dimension_name,
agg_type){

j_call = as.call(c(
as.name(。),
sapply(setNames(metric,metric_name),
function(var)as.call(list(as.name(agg_type),
as.name(var),na.rm = T)),
simplified = F)


dt [,eval(j_call),keyby = dimension] []
}

res_dt2 < -
fn_dt_agg2(
dt = dt,metric = c(a,b),metric_name = c(a,b),
dimension = c(c,d),dimension_name = c(c,d),
agg_type = c(sum))

all.equal(res_dt1,res_dt2)
#TRUE

3。 fn_dt_agg3

  fn_dt_agg3<  -  
函数(dt,metric,metric_name,dimension,dimension_name,agg_type ){

e < - eval(parse(text = paste0(function(x){,
agg_type,(,x,na.rm = T)} ))

temp < - dt [,setNames(lapply(.SD,e),
metric_name),
keyby = dimension,.SDcols = metric]
temp []
}

res_dt3 < -
fn_dt_agg3(
dt = dt,metric = c(a,b),metric_name = c(a,b),
dimension = c(c,d),dimension_name = c(c,d),
agg_type =sum )

all.equal(res_dt1,res_dt3)
#TRUE

4。 fn_df_agg1

  fn_df_agg1<  -  
function(df,metric,metric_name,dimension,dimension_name,agg_type ){

all_vars< - c(维度,度量)
all_vars_new< - c(dimension_name,metric_name)
dots_group< - lapply(dimension,as.name)

e< - eval(parse(text = paste0(function(x){,
agg_type,(,x,na.rm = T)})) )

df%>%
select _(。dots = all_vars)%>%
group_by _(。dots = dots_group)%>%
summarise_each_乐趣(e),度量)%>%
重命名_(。dots = setNames(all_vars,all_vars_new))
}

res_df1 < -
fn_df_agg1
df = df,metric = c(a,b),metric_name = c(a,b),
dimension = c(c,d ,dimension_name = c(c,d),
agg_type =sum)

all.equal(res_dt1,as.data.table(res_df1))
#数据集有不同的键。'target':c,d。'current没有钥匙。

基准测试



出于好奇心,对于我未来的自我和其他感兴趣的各方,我运行了所有4个实现的基准,这可能已经揭示了性能问题(尽管我不是一个基准专家,所以请原谅,如果我没有应用共同商定的最佳做法)。我期待fn_dt_agg1是最快的,因为它有一个参数较少(聚合函数),但似乎没有相当大的影响。我也感到惊讶的是相对缓慢的dplyr功能,但这可能是由于我的结局设计选择不好。

 库microbenchmark)
bench_res < -
microbenchmark(
fn_dt_agg1 =
fn_dt_agg1(
dt = dt,metric = c(a,b),
metric_name = c(a,b),
dimension = c(c,d),
dimension_name = c(c,d)) ,
fn_dt_agg2 =
fn_dt_agg2(
dt = dt,metric = c(a,b),
metric_name = c(a,b) ,
dimension = c(c,d),
dimension_name = c(c,d),
agg_type = c(sum)),
fn_dt_agg3 =
fn_dt_agg3(
dt = dt,metric = c(a,b),
metric_name = c(a,b),
dimension = c(c,d),
dimension_name = c(c,d),
agg_type = c(sum)),
fn_df_agg1 =
fn_df_agg1(
df = df,metric = c(a,b) ,metric_name = c(a,b),
dimension = c(c,d),dimension_name = c(c,d),
agg_type = sum),
times = 100L)

bench_res

#单位:毫秒
#expr最小lq平均值uq max neval
#fn_dt_agg1 28.96324 30.49507 35.60988 32.62860 37.43578 140.32975 100
#fn_dt_agg2 27.51993 28.41329 31.80023 28.93523 33.17064 84.56375 100
#fn_dt_agg3 25.46765 26.04711 30.11860 26.64817 30.28980 153.09715 100
#fn_df_agg1 88.33516 90.23776 97.84826 94.28843 97.97154 172.87838 100

其他资源




解决方案

我不推荐 eval(parse())。您可以实现与方法三相同:没有它:

  fn_dt_agg4<  -  
函数(dt, metric_name,dimension,dimension_name,agg_type){

e < - function(x)getFunction(agg_type)(x,na.rm = T)

temp < [,setNames(lapply(.SD,e),
metric_name),
keyby = dimension,.SDcols = metric]
temp []
}

这也避免了一些安全隐患。



PS:你可以检查通过设置 options(datatable.verbose= TRUE)

I am exploring different ways to wrap an aggregation function (but really it could be any type of function) using data.table (one dplyr example is also provided) and was wondering on best practices for functional programming / metaprogramming with respect to

  • performance (does the implementation matter with respect to potential optimization that data.table may apply)
  • readability (is there a commonly agreed standard e.g. in most packages utilizing data.table)
  • ease of generalization (are there differences in the way metaprogramming is "generalizable")

The basic application is to aggregate a table flexibly, i.e. parameterizing the variables to aggregate, the dimensions to aggregate by, the respective resulting variable names of both and the aggregation function. I have implemented (nearly) the same function in three data.table and one dplyr way:

  1. fn_dt_agg1 (here I couldn't figure out how parameterize the aggregation function)
  2. fn_dt_agg2 (inspired by @jangorecki 's answer here which he calls "computing on the language")
  3. fn_dt_agg3 (inspired by @Arun 's answer here which seems to be another approach of metaprogramming)
  4. fn_df_agg1 (my humble approach of the same in dplyr)

libraries

library(data.table)
library(dplyr)

data

n_size <- 1*10^6
sample_metrics <- sample(seq(from = 1, to = 100, by = 1), n_size, rep = T)
sample_dimensions <- sample(letters[10:12], n_size, rep = T)
df <- 
  data.frame(
    a = sample_metrics,
    b = sample_metrics,
    c = sample_dimensions,
    d = sample_dimensions,
    x = sample_metrics,
    y = sample_dimensions,
    stringsAsFactors = F)

dt <- as.data.table(df)

implementations

1. fn_dt_agg1

fn_dt_agg1 <- 
  function(dt, metric, metric_name, dimension, dimension_name) {

  temp <- dt[, setNames(lapply(.SD, function(x) {sum(x, na.rm = T)}), 
                        metric_name), 
             keyby = dimension, .SDcols = metric]
  temp[]
}

res_dt1 <- 
  fn_dt_agg1(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"))

2. fn_dt_agg2

fn_dt_agg2 <- 
  function(dt, metric, metric_name, dimension, dimension_name,
           agg_type) {

  j_call = as.call(c(
    as.name("."),
    sapply(setNames(metric, metric_name), 
           function(var) as.call(list(as.name(agg_type), 
                                      as.name(var), na.rm = T)), 
           simplify = F)
    ))

  dt[, eval(j_call), keyby = dimension][]
}

res_dt2 <- 
  fn_dt_agg2(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"),
    agg_type = c("sum"))

all.equal(res_dt1, res_dt2)
#TRUE

3. fn_dt_agg3

fn_dt_agg3 <- 
  function(dt, metric, metric_name, dimension, dimension_name, agg_type) {

  e <- eval(parse(text=paste0("function(x) {", 
                              agg_type, "(", "x, na.rm = T)}"))) 

  temp <- dt[, setNames(lapply(.SD, e), 
                        metric_name), 
             keyby = dimension, .SDcols = metric]
  temp[]
}

res_dt3 <- 
  fn_dt_agg3(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"), 
    agg_type = "sum")

all.equal(res_dt1, res_dt3)
#TRUE

4. fn_df_agg1

fn_df_agg1 <-
  function(df, metric, metric_name, dimension, dimension_name, agg_type) {

    all_vars <- c(dimension, metric)
    all_vars_new <- c(dimension_name, metric_name)
    dots_group <- lapply(dimension, as.name)

    e <- eval(parse(text=paste0("function(x) {", 
                                agg_type, "(", "x, na.rm = T)}")))

    df %>%
      select_(.dots = all_vars) %>%
      group_by_(.dots = dots_group) %>%
      summarise_each_(funs(e), metric) %>%
      rename_(.dots = setNames(all_vars, all_vars_new))
}

res_df1 <- 
  fn_df_agg1(
    df = df, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"),
    agg_type = "sum")

all.equal(res_dt1, as.data.table(res_df1))
#"Datasets has different keys. 'target': c, d. 'current' has no key."

benchmarking

Just out of curiosity and for my future self and other interested parties, I ran a benchmark of all 4 implementations which potentially already sheds light on the performance issue (although I'm not a benchmarking expert so please excuse if I haven't applied commonly agreed best practices). I was expecting fn_dt_agg1 to be the fastest as it has one parameter less (aggregation function) but that doesn't seem to have a sizable impact. I was also surprised by the relatively slow dplyr function but this may be due to a bad design choice on my end.

library(microbenchmark)
bench_res <- 
  microbenchmark(
    fn_dt_agg1 = 
      fn_dt_agg1(
      dt = dt, metric = c("a", "b"), 
      metric_name = c("a", "b"), 
      dimension = c("c", "d"), 
      dimension_name = c("c", "d")), 
    fn_dt_agg2 = 
      fn_dt_agg2(
        dt = dt, metric = c("a", "b"), 
        metric_name = c("a", "b"), 
        dimension = c("c", "d"), 
        dimension_name = c("c", "d"),
        agg_type = c("sum")),
    fn_dt_agg3 =
      fn_dt_agg3(
        dt = dt, metric = c("a", "b"), 
        metric_name = c("a", "b"),
        dimension = c("c", "d"), 
        dimension_name = c("c", "d"),
        agg_type = c("sum")),
    fn_df_agg1 =
      fn_df_agg1(
        df = df, metric = c("a", "b"), metric_name = c("a", "b"),
        dimension = c("c", "d"), dimension_name = c("c", "d"),
        agg_type = "sum"),
    times = 100L)

bench_res

# Unit: milliseconds
#       expr      min       lq     mean   median       uq       max neval
# fn_dt_agg1 28.96324 30.49507 35.60988 32.62860 37.43578 140.32975   100
# fn_dt_agg2 27.51993 28.41329 31.80023 28.93523 33.17064  84.56375   100
# fn_dt_agg3 25.46765 26.04711 30.11860 26.64817 30.28980 153.09715   100
# fn_df_agg1 88.33516 90.23776 97.84826 94.28843 97.97154 172.87838   100

other resources

解决方案

I don't recommend eval(parse()). You can achieve the same as in approach three without it:

fn_dt_agg4 <- 
  function(dt, metric, metric_name, dimension, dimension_name, agg_type) {

    e <- function(x) getFunction(agg_type)(x, na.rm = T)

    temp <- dt[, setNames(lapply(.SD, e), 
                          metric_name), 
               keyby = dimension, .SDcols = metric]
    temp[]
  }

This also avoids some security risks.

PS: You can check what data.table is doing regarding optimizations by setting options("datatable.verbose" = TRUE).

这篇关于r数据表功能编程/元编程/计算语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆