R：在数据表中的列上循环 [英] R: loop over columns in data.table

查看：239 发布时间：2017/3/12 10:56:36 r data.table sapply

本文介绍了R：在数据表中的列上循环的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想确定大型data.table的列类。

I want to determine the column classes of a large data.table.

colClasses <- sapply(DT, FUN=function(x)class(x)[1])

到内存中：

> memory.size()
[1] 687.59
> colClasses <- sapply(DT, class)
> memory.size()
[1] 1346.21

循环似乎不可能，因为data.tablewith = FALSE总是产生一个data.table。

A loop seems not possible, because a data.table "with=FALSE" always results in a data.table.

一个快速和非常脏的方法是：

A quick and very dirty method is:

DT1 <- DT[1, ]
colClasses <- sapply(DT1, FUN=function(x)class(x)[1])

最有效的方法是什么？

推荐答案

进行了简要调查，它看起来像一个 data.table 错误。

Have briefly investigated, and it looks like a data.table bug.

> DT = data.table(a=1:1e6,b=1:1e6,c=1:1e6,d=1:1e6)
> Rprofmem()
> sapply(DT,class)
        a         b         c         d 
"integer" "integer" "integer" "integer" 
> Rprofmem(NULL)
> noquote(readLines("Rprofmem.out"))
[1] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"       
[2] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply" 
[3] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"   
[4] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply" 

> tracemem(DT)
> sapply(DT,class)
tracemem[000000000431A290 -> 00000000065D70D8]: as.list.data.table as.list lapply sapply 
        a         b         c         d 
"integer" "integer" "integer" "integer"

因此，查看 as.list.data.table ：

> data.table:::as.list.data.table
function (x, ...) 
{
    ans <- unclass(x)
    setattr(ans, "row.names", NULL)
    setattr(ans, "sorted", NULL)
    setattr(ans, ".internal.selfref", NULL)
    ans
}
<environment: namespace:data.table>
>

请注意第一行上的pesky unclass 。 ？unclass 确认它需要其参数的深拷贝。从这个快速的看起来，似乎不像 sapply 或 lapply 正在做复制（我不认为他们因为R在写时复制是好的，那些不是写），而是 as.list 在 lapply （分发到 as.list.data.table ）。

Note the pesky unclass on the first line. ?unclass confirms that it takes a deep copy of its argument. From this quick look it doesn't seem like sapply or lapply are doing the copying (I didn't think they did since R is good at copy-on-write, and those aren't writing), but rather the as.list in lapply (which dispatches to as.list.data.table).

unclass ，它应该会加快。让我们尝试：

So, if we avoid the unclass, it should speed up. Let's try:

> DT = data.table(a=1:1e7,b=1:1e7,c=1:1e7,d=1:1e7)
> system.time(sapply(DT,class))
   user  system elapsed 
   0.28    0.06    0.35 
> system.time(sapply(DT,class))  # repeat timing a few times and take minimum
   user  system elapsed 
   0.17    0.00    0.17 
> system.time(sapply(DT,class))
   user  system elapsed 
   0.13    0.04    0.18 
> system.time(sapply(DT,class))
   user  system elapsed 
   0.14    0.03    0.17 
> assignInNamespace("as.list.data.table",function(x)x,"data.table")
> data.table:::as.list.data.table
function(x)x
> system.time(sapply(DT,class))
   user  system elapsed 
      0       0       0 
> system.time(sapply(DT,class))
   user  system elapsed 
   0.01    0.00    0.02 
> system.time(sapply(DT,class))
   user  system elapsed 
      0       0       0 
> sapply(DT,class)
        a         b         c         d 
"integer" "integer" "integer" "integer" 
>

所以，是，无限地。

我提出了错误报告＃2000 删除 as.list.data.table 方法，因为 data.table is（）已经是一个列表。这可能会加速相当多的成语，例如 lapply（.SD，...）。。

I've raised bug report #2000 to remove the as.list.data.table method, since a data.table is() already a list, too. This might speed up quite a few idioms actually, such as lapply(.SD,...). .

感谢您提出这个问题！

这篇关于R：在数据表中的列上循环的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R：在数据表中的列上循环 [英] R: loop over columns in data.table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R：在数据表中的列上循环 [英] R: loop over columns in data.table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭