R:循环遍历 data.table 中的列 [英] R: loop over columns in data.table
问题描述
我想确定一个大data.table的列类.
I want to determine the column classes of a large data.table.
colClasses <- sapply(DT, FUN=function(x)class(x)[1])
有效,但显然本地副本存储到内存中:
works, but apparently local copies are stored into memory:
> memory.size()
[1] 687.59
> colClasses <- sapply(DT, class)
> memory.size()
[1] 1346.21
循环似乎是不可能的,因为 data.table "with=FALSE" 总是导致 data.table.
A loop seems not possible, because a data.table "with=FALSE" always results in a data.table.
一个快速且非常肮脏的方法是:
A quick and very dirty method is:
DT1 <- DT[1, ]
colClasses <- sapply(DT1, FUN=function(x)class(x)[1])
什么是最优雅和最有效的方法?
What is the most elegent and efficient way to do this?
推荐答案
简单排查了一下,貌似是data.table
的bug.
Have briefly investigated, and it looks like a data.table
bug.
> DT = data.table(a=1:1e6,b=1:1e6,c=1:1e6,d=1:1e6)
> Rprofmem()
> sapply(DT,class)
a b c d
"integer" "integer" "integer" "integer"
> Rprofmem(NULL)
> noquote(readLines("Rprofmem.out"))
[1] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"
[2] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"
[3] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"
[4] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"
> tracemem(DT)
> sapply(DT,class)
tracemem[000000000431A290 -> 00000000065D70D8]: as.list.data.table as.list lapply sapply
a b c d
"integer" "integer" "integer" "integer"
所以,看看 as.list.data.table
:
> data.table:::as.list.data.table
function (x, ...)
{
ans <- unclass(x)
setattr(ans, "row.names", NULL)
setattr(ans, "sorted", NULL)
setattr(ans, ".internal.selfref", NULL)
ans
}
<environment: namespace:data.table>
>
注意第一行讨厌的 unclass
.?unclass
确认它接受了其参数的深层副本.从这个快速的外观来看,似乎 sapply
或 lapply
并没有进行复制(我认为他们没有这样做,因为 R 擅长写时复制,并且那些不是写的),而是 lapply
中的 as.list
(调度到 as.list.data.table
).
Note the pesky unclass
on the first line. ?unclass
confirms that it takes a deep copy of its argument. From this quick look it doesn't seem like sapply
or lapply
are doing the copying (I didn't think they did since R is good at copy-on-write, and those aren't writing), but rather the as.list
in lapply
(which dispatches to as.list.data.table
).
所以,如果我们避免 unclass
,它应该会加快速度.让我们试试吧:
So, if we avoid the unclass
, it should speed up. Let's try:
> DT = data.table(a=1:1e7,b=1:1e7,c=1:1e7,d=1:1e7)
> system.time(sapply(DT,class))
user system elapsed
0.28 0.06 0.35
> system.time(sapply(DT,class)) # repeat timing a few times and take minimum
user system elapsed
0.17 0.00 0.17
> system.time(sapply(DT,class))
user system elapsed
0.13 0.04 0.18
> system.time(sapply(DT,class))
user system elapsed
0.14 0.03 0.17
> assignInNamespace("as.list.data.table",function(x)x,"data.table")
> data.table:::as.list.data.table
function(x)x
> system.time(sapply(DT,class))
user system elapsed
0 0 0
> system.time(sapply(DT,class))
user system elapsed
0.01 0.00 0.02
> system.time(sapply(DT,class))
user system elapsed
0 0 0
> sapply(DT,class)
a b c d
"integer" "integer" "integer" "integer"
>
所以,是的,无限更好.
我提出了 错误报告 #2000 删除 as.list.data.table
方法,因为 data.table
is()
也已经是一个 list
.这实际上可能会加速很多习语,例如 lapply(.SD,...)
..
I've raised bug report #2000 to remove the as.list.data.table
method, since a data.table
is()
already a list
, too. This might speed up quite a few idioms actually, such as lapply(.SD,...)
. .
感谢您提出这个问题!
这篇关于R:循环遍历 data.table 中的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!