如何以编程方式基于data.table中的分类变量创建二进制列? [英] How to programmatically create binary columns based on a categorical variable in data.table?
问题描述
我有一个大的(12百万行) data.table
它看起来像这样:
library(data.table)
set.seed(123)
dt< - data.table(id = rep(1: each = 5),y = sample(letters [1:5],15,replace = T))
> dt
id y
1:1 b
2:1 d
3:1 c
4:1 e
5:1 e $ b b 6:2 a
7:2 c
8:2 e
9:2 c
10:2 c
11:3 e
12 :3 c
13:3 d
14:3 c
15:3 a
b $ b
我想创建一个包含我的变量 id
的新的 data.table
这个新的 data.table
的唯一键)和5个其他二进制变量,每个对应于 y
的每个类别value 1
如果id具有 y
的值, 0
否则。
输出 data.table
应如下所示:
id abcde
1:1 0 1 1 1 1
2:2 1 0 1 0 1
3:3 1 0 1 1 1
我试着在循环中这样做,但是它很慢,
编辑:以@mtoto格式传递二进制变量名称,以编程方式传递二进制变量名称指出,类似的问题已经提出并回答了这里,但解决方案是使用 reshape2
包。
我想知道是否有另一个更快)通过在数据表中使用:=
运算符来实现,因为我有一个庞大的数据集,我在这个包中工作很多。 / p>
EDIT2 :@ Arun在我的数据上的功能的基准(约1200万行,〜3,5万不同ID和490不同 y
变量的标签(导致490个虚拟变量)):
system.time(ans1 < - AnsFunction())#194s
system.time(ans2 < - dcastFunction())#55s
system.time(ans3< ; - TableFunction())#永远阻止我的电脑
p> data.table 有自己的 dcast
实现使用data.table的内部结构,应该快。尝试一下:
dcast(dt,id〜y,fun.aggregate = function(x)1L,fill = 0L )
#id abcde
#1:1 0 1 1 1 1
#2:2 1 0 1 0 1
#3:3 1 0 1 1 1
只是想到另一种处理方式,预先分配和更新引用(也许dcast的逻辑应该这样做,以避免中间体)。
ans = data.table $ id))[,unique(dt $ y):= 0L] []
是用 1L
填充现有组合。
dt [,{set (ans,i = .GRP,j = unique(y),value = 1L); NULL},by = id]
ans
#id bdcea
#1:1 1 1 1 1 0
#2:2 0 0 1 1 1
# 3:3 0 1 1 1 1
我已经在基准的基础上OP数据维度,约1000万行和10列。
require(data.table)
set.seed(45L)
y = apply(matrix(letters(letters,10L * 20L,TRUE),ncol = 20L),1L,paste,collapse =)
dt = data.table(id = sample(1e5,1e7,TRUE),y = sample(y,1e7,TRUE))
system.time(ans1 < - AnsFunction())#2.3s
system.time(ans2 < - dcastFunction())#2.2s
system.time(ans3 < - TableFunction())#6.2s
setcolorder (ans2))
setcolorder(ans3,names(ans2))
setorder(ans1,id)
setkey(ans2,NULL)
setorder b
相同(ans1,ans2)#TRUE
相同(ans1,ans3)#TRUE
$ b b
其中,
AnsFunction< - function(){
ans = data.table(id =唯一的(dt $ id))[,unique(dt $ y):= 0L] []
dt [,{set(ans,i = .GRP,j = unique(y),value = 1L); NULL},by = id]
ans
#重新排序列外
}
dcastFunction< - function(){
# reshape2。 data.table也有自己的dcast以及
#不需要setDT
df < - dcast(dt,id〜y,fun.aggregate = function(x)1L,fill = 0L,value。 tableFunction< - function(){
#需要返回整数结果以获得相同的结果
#fixed 1 - > 1L; as.numeric - > as.integer
df< - as.data.frame.matrix(table(dt $ id,dt $ y))
df [df> 1L] < - 1L
df < - cbind(id = as.integer(row.names(df)),df)
setDT(df)
}
I have a big (12 million rows) data.table
which looks like this:
library(data.table)
set.seed(123)
dt <- data.table(id=rep(1:3, each=5),y=sample(letters[1:5],15,replace = T))
> dt
id y
1: 1 b
2: 1 d
3: 1 c
4: 1 e
5: 1 e
6: 2 a
7: 2 c
8: 2 e
9: 2 c
10: 2 c
11: 3 e
12: 3 c
13: 3 d
14: 3 c
15: 3 a
I want to create a new data.table
containing my variable id
(which will be the unique key of this new data.table
) and 5 other binary variables each one corresponding to each category of y
which take value 1
if the id has that value for y
, 0
otherwise.
The output data.table
should look like this:
id a b c d e
1: 1 0 1 1 1 1
2: 2 1 0 1 0 1
3: 3 1 0 1 1 1
I tried doing this in a loop but it's quite slow and also I don't know how to pass the binary variable names programmatically, as they depend on the variable I'm trying to "split".
EDIT: as @mtoto pointed out, a similar question has already been asked and answered here, but the solution is using the reshape2
package.
I was wondering if there's another (faster) way to do so by maybe using the :=
operator in data.table, as I have a massive dataset and I'm working quite a lot with this package.
EDIT2: benchmark of the functions in @Arun's post on my data (~12 million rows, ~3,5 million different ids and 490 different labels for the y
variable (resulting in 490 dummy variables)):
system.time(ans1 <- AnsFunction()) # 194s
system.time(ans2 <- dcastFunction()) # 55s
system.time(ans3 <- TableFunction()) # Takes forever and blocked my PC
data.table has its own dcast
implementation using data.table's internals and should be fast. Give this a try:
dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L)
# id a b c d e
# 1: 1 0 1 1 1 1
# 2: 2 1 0 1 0 1
# 3: 3 1 0 1 1 1
Just thought of another way to handle this by preallocating and updating by reference (perhaps dcast's logic should be done like this to avoid intermediates).
ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
All that's left is to fill existing combinations with 1L
.
dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
ans
# id b d c e a
# 1: 1 1 1 1 1 0
# 2: 2 0 0 1 1 1
# 3: 3 0 1 1 1 1
Okay, I've gone ahead on benchmarked on OP's data dimensions with ~10 million rows and 10 columns.
require(data.table)
set.seed(45L)
y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")
dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))
system.time(ans1 <- AnsFunction()) # 2.3s
system.time(ans2 <- dcastFunction()) # 2.2s
system.time(ans3 <- TableFunction()) # 6.2s
setcolorder(ans1, names(ans2))
setcolorder(ans3, names(ans2))
setorder(ans1, id)
setkey(ans2, NULL)
setorder(ans3, id)
identical(ans1, ans2) # TRUE
identical(ans1, ans3) # TRUE
where,
AnsFunction <- function() {
ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
ans
# reorder columns outside
}
dcastFunction <- function() {
# no need to load reshape2. data.table has its own dcast as well
# no need for setDT
df <- dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")
}
TableFunction <- function() {
# need to return integer results for identical results
# fixed 1 -> 1L; as.numeric -> as.integer
df <- as.data.frame.matrix(table(dt$id, dt$y))
df[df > 1L] <- 1L
df <- cbind(id = as.integer(row.names(df)), df)
setDT(df)
}
这篇关于如何以编程方式基于data.table中的分类变量创建二进制列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!