我如何以类似dcast的方式自加入data.table [英] How do I self join a data.table in a manner like dcast
问题描述
假设我在熔解表单中有一个 data.table
,其中有一个键,以及标识符和值
library(data.table)
library(reshape2)
DT = data.table(X = c(1:5,1:4) Y = c(rep(A,5),rep(B,4)),Z = rnorm(9) b $ b
如何在 data.table
?
> DT
XYZ
1:1 A -0.19790449
2:2 A 0.17906116
3:3 A 0.01821837
4:4 A 0.17309716
5:5 A 0.05962474
6:1 B -0.24629468
7:2 B 0.92285734
8:3 B 0.66002573
9:4 B -1.01403880
> DT2
XAB
1:1 -0.19790449 -0.2462947
2:2 0.17906116 0.9228573
3:3 0.01821837 0.6600257
4:4 0.17309716 -1.0140388
5 :5 0.05962474 NA
Aside(主要针对Arun):
这是一个我已经使用的解决方案(由马修D的帮助,所以他应该有这个代码),我认为复制完全融化,是相当高效。 Dcast在另一方面(或者应该是dtcast?)是更难的!
melt.data.table = function ,id.vars,measure.vars,
variable.name =variable,
...,na.rm = FALSE,value.name =value){
if缺少(id.vars)){
id.vars = setdiff(names(data),measure.vars)
}
if(missing(measure.vars)){
measure.vars = setdiff(names(data),id.vars)
}
dtlist = lapply(measure.vars,function(.. colname){
data [ c(id.vars,..colname),with = FALSE] [,(variable.name):= ..colname]
})
dt = rbindlist $ b setnames(dt,measure.vars [1],value.name)
if(na.rm){
return(na.omit(dt))
} else {
return(dt)
}
}
melt
和 dcast
in C)in data.table
versions > = 1.9.0
。检查 此信息 现在你可以这样做:
dcast.data.table(DT,X〜Y)
> dcast 单独,目前,它必须完全写出来(因为它不是一个S3通用但 reshape2
)。我们会尽快解决这个问题。对于熔化,
,您只需使用熔化(。)
$ b
大致的想法是:
设置密钥(DT,X,Y)
DT [CJ(1:5,c(A,B)) / code>
您可以将列命名为 V1
和 V2
as A
和 B
使用 setnames
。
但这对大型数据或当转换公式很复杂时可能效率不高。或者我应该说,它可以更高效。我们正在找到这样一个实现,将熔融和铸造集成到data.table。
一旦我们在熔体/铸造方面取得了重大进展,我会更新这篇文章。
Suppose I have a data.table
in "melted" form where I have a key, and identifier and a value
library(data.table)
library(reshape2)
DT = data.table(X = c(1:5, 1:4), Y = c(rep("A", 5), rep("B", 4)), Z = rnorm(9))
DT2 = data.table(dcast(DT, X~Y))
How can I perform that sort of self join inside data.table
?
> DT
X Y Z
1: 1 A -0.19790449
2: 2 A 0.17906116
3: 3 A 0.01821837
4: 4 A 0.17309716
5: 5 A 0.05962474
6: 1 B -0.24629468
7: 2 B 0.92285734
8: 3 B 0.66002573
9: 4 B -1.01403880
> DT2
X A B
1: 1 -0.19790449 -0.2462947
2: 2 0.17906116 0.9228573
3: 3 0.01821837 0.6600257
4: 4 0.17309716 -1.0140388
5: 5 0.05962474 NA
Aside (mostly for Arun): Here is a solution I already use for melt (was written with help from Matthew D, so he should have this code), that I think replicates melt completely, and is pretty efficient. Dcast on the other hand (or should that be dtcast?) is much harder!
melt.data.table = function(data, id.vars, measure.vars,
variable.name = "variable",
..., na.rm = FALSE, value.name = "value") {
if(missing(id.vars)){
id.vars = setdiff(names(data), measure.vars)
}
if(missing(measure.vars)){
measure.vars = setdiff(names(data), id.vars)
}
dtlist = lapply(measure.vars, function(..colname) {
data[, c(id.vars, ..colname), with = FALSE][, (variable.name) := ..colname]
})
dt = rbindlist(dtlist)
setnames(dt, measure.vars[1], value.name)
if(na.rm){
return(na.omit(dt))
} else {
return(dt)
}
}
Update: faster versions of melt
and dcast
are now implemented (in C) in data.table
versions >= 1.9.0
. Check this post for more info.
Now you can just do:
dcast.data.table(DT, X~Y)
In case of dcast
alone, at the moment, it has to be written out completely (as it's not a S3 generic yet in reshape2
). We'll try to fix this as soon as possible. For melt,
you can just use melt(.)
as you'd do normally.
The general idea is this:
setkey(DT, X, Y)
DT[CJ(1:5, c("A", "B"))][, as.list(Z), by=X]
You can name the columns V1
and V2
as A
and B
using setnames
.
But this may not be efficient on large data or when the cast formula is complex. Or rather I should say, it could be much more efficient. We're in the process of finding such an implementation to integrate melt and cast on to data.table. Until then, you could get around this as above.
I'll update this post once we've made significant progress with melt/cast.
这篇关于我如何以类似dcast的方式自加入data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!