为 data.table 对象编写函数(过程) [英] Writings functions (procedures) for data.table objects

查看:19
本文介绍了为 data.table 对象编写函数(过程)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Software for Data Analysis: Programming with R一书中,John Chambers 强调,通常不应该为函数的副作用而编写函数;相反,一个函数应该在不修改其调用环境中的任何变量的情况下返回一个值.相反,使用 data.table 对象编写好的脚本应该特别避免使用带有 <- 的对象赋值,通常用于存储函数的结果.

In the book Software for Data Analysis: Programming with R, John Chambers emphasizes that functions should generally not be written for their side effect; rather, that a function should return a value without modifying any variables in its calling environment. Conversely, writing good script using data.table objects should specifically avoid the use of object assignment with <-, typically used to store the result of a function.

首先,是一个技术问题.想象一个名为 proc1 的 R 函数,它接受一个 data.table 对象 x 作为其参数(可能还有其他参数).proc1 返回 NULL,但使用 := 修改 x.据我了解,调用 proc1(x=x1)proc1 会复制 x1 只是因为承诺的工作方式.但是,如下所示,原始对象 x1 仍然被 proc1 修改.为什么/这是怎么回事?

First, is a technical question. Imagine an R function called proc1 that accepts a data.table object x as its argument (in addition to, maybe, other parameters). proc1 returns NULL but modifies x using :=. From what I understand, proc1 calling proc1(x=x1) makes a copy of x1 just because of the way that promises work. However, as demonstrated below, the original object x1 is still modified by proc1. Why/how is this?

> require(data.table)
> x1 <- CJ(1:2, 2:3)
> x1
   V1 V2
1:  1  2
2:  1  3
3:  2  2
4:  2  3
> proc1 <- function(x){
+ x[,y:= V1*V2]
+ NULL
+ }
> proc1(x1)
NULL
> x1
   V1 V2 y
1:  1  2 2
2:  1  3 3
3:  2  2 4
4:  2  3 6
> 

此外,使用 proc1(x=x1) 似乎并不比直接在 x 上执行该过程慢,这表明我对 Promise 的模糊理解是错误的,并且它们在传递引用的方式:

Furthermore, it seems that using proc1(x=x1) isn't any slower than doing the procedure directly on x, indicating that my vague understanding of promises are wrong and that they work in a pass-by-reference sort of way:

> x1 <- CJ(1:2000, 1:500)
> x1[, paste0("V",3:300) := rnorm(1:nrow(x1))]
> proc1 <- function(x){
+ x[,y:= V1*V2]
+ NULL
+ }
> system.time(proc1(x1))
   user  system elapsed 
   0.00    0.02    0.02 
> x1 <- CJ(1:2000, 1:500)
> system.time(x1[,y:= V1*V2])
   user  system elapsed 
   0.03    0.00    0.03 

因此,鉴于将 data.table 参数传递给函数不会增加时间,这使得为 data.table 对象编写过程成为可能,同时结合了 data.table 的速度和函数的通用性.但是,鉴于 John Chambers 所说,函数不应该有副作用,在 R 中编写这种类型的程序编程真的可以"吗?为什么他认为副作用是坏的"?如果我要忽略他的建议,我应该注意哪些陷阱?我能做些什么来编写好"的 data.table 程序?

So, given that passing a data.table argument to a function doesn't add time, that makes it possible to write procedures for data.table objects, incorporating both the speed of data.table and the generalizability of a function. However, given what John Chambers said, that functions should not have side-effects, is it really "ok" to write this type of procedural programming in R? Why was he arguing that side effects are "bad"? If I'm going to ignore his advice, what sort of pitfalls should I be aware of? What can I do to write "good" data.table procedures?

推荐答案

是的,data.tables中列的增、改、删都是由reference完成的.从某种意义上说,这是一件的事情,因为一个data.table通常会保存很多数据,而且每次都重新分配它会非常耗费内存和时间对其进行了更改.另一方面,这是一个的事情,因为它违背了 R 试图通过使用 pass-by 来推广的 no-side-effect 函数式编程方法-value 默认情况下.使用无副作用编程,调用函数时无需担心:您可以放心,您的输入或您的环境不会受到影响,您可以只关注函数的输出.它很简单,因此很舒服.

Yes, the addition, modification, deletion of columns in data.tables is done by reference. In a sense, it is a good thing because a data.table usually holds a lot of data, and it would be very memory and time consuming to reassign it all every time a change to it is made. On the other hand, it is a bad thing because it goes against the no-side-effect functional programming approach that R tries to promote by using pass-by-value by default. With no-side-effect programming, there is little to worry about when you call a function: you can rest assured that your inputs or your environment won't be affected, and you can just focus on the function's output. It's simple, hence comfortable.

如果您知道自己在做什么,当然可以无视 John Chambers 的建议.关于编写好"的 data.tables 程序,如果我是你,我会考虑以下几条规则,作为限制复杂性和副作用数量的一种方式:

Of course it is ok to disregard John Chambers's advice if you know what you are doing. About writing "good" data.tables procedures, here are a couple rules I would consider if I were you, as a way to limit complexity and the number of side-effects:

  • 一个函数不应修改多个表,即修改该表应该是唯一的副作用,
  • 如果一个函数修改了一个表,则将该表作为函数的输出.当然,您不会想重新分配它:只需运行 do.something.to(table) 而不是 table <- do.something.to(table).相反,如果函数有另一个(真实")输出,那么当调用 result <- do.something.to(table) 时,很容易想象如何将注意力集中在输出上忘记调用该函数对您的桌子有副作用.
  • a function should not modify more than one table, i.e., modifying that table should be the only side-effect,
  • if a function modifies a table, then make that table the output of the function. Of course, you won't want to re-assign it: just run do.something.to(table) and not table <- do.something.to(table). If instead the function had another ("real") output, then when calling result <- do.something.to(table), it is easy to imagine how you may focus your attention on the output and forget that calling the function had a side effect on your table.

虽然一个输出/无副作用"函数是 R 中的规范,但上述规则允许一个输出或副作用".如果您同意副作用在某种程度上是一种输出形式,那么您就会同意我并没有过多地违反规则,松散地坚持 R 的单输出函数式编程风格.允许函数有多种副作用会有点牵强;不是说你做不到,但如果可能的话,我会尽量避免.

While "one output / no-side-effect" functions are the norm in R, the above rules allow for "one output or side-effect". If you agree that a side-effect is somehow a form of output, then you'll agree I am not bending the rules too much by loosely sticking to R's one-output functional programming style. Allowing functions to have multiple side-effects would be a little more of a stretch; not that you can't do it, but I would try to avoid it if possible.

这篇关于为 data.table 对象编写函数(过程)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆