data.table对象的写入函数(过程) [英] Writings functions (procedures) for data.table objects

查看:90
本文介绍了data.table对象的写入函数(过程)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在书中数据分析软件:使用R进行编程,John Chambers强调,函数通常不应为其副作用编写;而是,函数应该返回一个值,而不修改其调用环境中的任何变量。相反,使用data.table对象编写好脚本应该避免使用通常用于存储函数结果的< - 的对象赋值。



首先,是一个技术问题。想象一个名为 proc1 的R函数接受 data.table 对象 x 作为其参数(除了,也许,其他参数)。 proc1 返回NULL,但使用:= 修改 x 。从我的理解, proc1 调用 proc1(x = x1)复制 x1 只是因为promise的工作方式。但是,如下所示,原始对象 x1 仍然由 proc1 修改。为什么/这是怎么回事?

 > require(data.table)
> x1 <-CJ(1:2,2:3)
> x1
V1 V2
1:1 2
2:1 3
3:2 2
4:2 3
> proc1 < - function(x){
+ x [,y:= V1 * V2]
+ NULL
+}
> proc1(x1)
NULL
> x1
V1 V2 y
1:1 2 2
2:1 3 3
3:2 2 4
4:2 3 6
>此外,似乎使用 proc1(x = x1)

不比直接在x上执行过程慢,这表明我对promises的模糊理解是错误的,并且他们按引用传递的方式工作:

 > x1 <-CJ(1:2000,1:500)
> x1 [,paste0(V,3:300):= rnorm(1:nrow(x1))]
& proc1 < - function(x){
+ x [,y:= V1 * V2]
+ NULL
+}
> system.time(proc1(x1))
用户系统已过
0.00 0.02 0.02
> x1 <-CJ(1:2000,1:500)
> system.time(x1 [,y:= V1 * V2])
用户系统已过
0.03 0.00 0.03

因此,由于将一个data.table参数传递给一个函数不会增加时间,因此可以为data.table对象编写过程,同时结合data.table的速度和函数的泛化性。然而,考虑到约翰·钱伯斯说,这个函数不应该有副作用,在R中编写这种类型的程序编程真的好吗?为什么他认为副作用是坏?如果我不理会他的建议,我应该注意什么样的陷阱?我可以做什么写好的data.table程序?

解决方案

是的,添加,修改,删除列 data.table 引用完成。在某种意义上,它是一个很好的事情,因为 data.table 通常拥有大量的数据,这将是非常记忆和耗时的在每次对其进行改变时重新分配它。另一方面,这是一个不好的事情,因为它违背了无效果函数编程方法,R试图通过使用 pass-by-value 。使用无副作用编程,在调用函数时几乎不用担心:您可以放心,您的输入或您的环境不会受到影响,您可以只关注函数的输出。这很简单,因此很舒服。



当然,如果你知道你在做什么,那么忽略John Chambers的建议是可以的。关于写好的data.tables过程,这里有一些规则,如果我是你,作为限制复杂性和副作用的数量的一个方法:




  • 函数不应修改多个表,即修改该表应该是唯一的副作用,

  • 如果函数修改表,然后使该表的函数的输出。当然,你不想重新分配它:只需运行 do.something.to(table)而不是 table< - do .something.to(table)。如果相反,函数有另一个(真实)输出,那么当调用 result< - do.something.to(table)时,很容易想象如何



当一个输出/无输出功能对您的表格有副作用时,副作用函数是R中的规范,上述规则允许一个输出或副作用。如果你同意一个副作用是某种形式的输出,那么你会同意,我不是通过松散地坚持R的单输出功能编程风格过于弯曲的规则。允许功能具有多种副作用将是一个更大的延长;不是你不能做到的,但如果可能,我会尽量避免。


In the book Software for Data Analysis: Programming with R, John Chambers emphasizes that functions should generally not be written for their side effect; rather, that a function should return a value without modifying any variables in its calling environment. Conversely, writing good script using data.table objects should specifically avoid the use of object assignment with <-, typically used to store the result of a function.

First, is a technical question. Imagine an R function called proc1 that accepts a data.table object x as its argument (in addition to, maybe, other parameters). proc1 returns NULL but modifies x using :=. From what I understand, proc1 calling proc1(x=x1) makes a copy of x1 just because of the way that promises work. However, as demonstrated below, the original object x1 is still modified by proc1. Why/how is this?

> require(data.table)
> x1 <- CJ(1:2, 2:3)
> x1
   V1 V2
1:  1  2
2:  1  3
3:  2  2
4:  2  3
> proc1 <- function(x){
+ x[,y:= V1*V2]
+ NULL
+ }
> proc1(x1)
NULL
> x1
   V1 V2 y
1:  1  2 2
2:  1  3 3
3:  2  2 4
4:  2  3 6
> 

Furthermore, it seems that using proc1(x=x1) isn't any slower than doing the procedure directly on x, indicating that my vague understanding of promises are wrong and that they work in a pass-by-reference sort of way:

> x1 <- CJ(1:2000, 1:500)
> x1[, paste0("V",3:300) := rnorm(1:nrow(x1))]
> proc1 <- function(x){
+ x[,y:= V1*V2]
+ NULL
+ }
> system.time(proc1(x1))
   user  system elapsed 
   0.00    0.02    0.02 
> x1 <- CJ(1:2000, 1:500)
> system.time(x1[,y:= V1*V2])
   user  system elapsed 
   0.03    0.00    0.03 

So, given that passing a data.table argument to a function doesn't add time, that makes it possible to write procedures for data.table objects, incorporating both the speed of data.table and the generalizability of a function. However, given what John Chambers said, that functions should not have side-effects, is it really "ok" to write this type of procedural programming in R? Why was he arguing that side effects are "bad"? If I'm going to ignore his advice, what sort of pitfalls should I be aware of? What can I do to write "good" data.table procedures?

解决方案

Yes, the addition, modification, deletion of columns in data.tables is done by reference. In a sense, it is a good thing because a data.table usually holds a lot of data, and it would be very memory and time consuming to reassign it all every time a change to it is made. On the other hand, it is a bad thing because it goes against the no-side-effect functional programming approach that R tries to promote by using pass-by-value by default. With no-side-effect programming, there is little to worry about when you call a function: you can rest assured that your inputs or your environment won't be affected, and you can just focus on the function's output. It's simple, hence comfortable.

Of course it is ok to disregard John Chambers's advice if you know what you are doing. About writing "good" data.tables procedures, here are a couple rules I would consider if I were you, as a way to limit complexity and the number of side-effects:

  • a function should not modify more than one table, i.e., modifying that table should be the only side-effect,
  • if a function modifies a table, then make that table the output of the function. Of course, you won't want to re-assign it: just run do.something.to(table) and not table <- do.something.to(table). If instead the function had another ("real") output, then when calling result <- do.something.to(table), it is easy to imagine how you may focus your attention on the output and forget that calling the function had a side effect on your table.

While "one output / no-side-effect" functions are the norm in R, the above rules allow for "one output or side-effect". If you agree that a side-effect is somehow a form of output, then you'll agree I am not bending the rules too much by loosely sticking to R's one-output functional programming style. Allowing functions to have multiple side-effects would be a little more of a stretch; not that you can't do it, but I would try to avoid it if possible.

这篇关于data.table对象的写入函数(过程)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆