如何以编程方式基于data.table中的分类变量创建二进制列? [英] How to programmatically create binary columns based on a categorical variable in data.table?

查看:183
本文介绍了如何以编程方式基于data.table中的分类变量创建二进制列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的(12百万行) data.table 它看起来像这样:

  library(data.table)
set.seed(123)
dt< - data.table(id = rep(1: each = 5),y = sample(letters [1:5],15,replace = T))
> dt
id y
1:1 b
2:1 d
3:1 c
4:1 e
5:1 e $ b b 6:2 a
7:2 c
8:2 e
9:2 c
10:2 c
11:3 e
12 :3 c
13:3 d
14:3 c
15:3 a


b $ b

我想创建一个包含我的变量 id 的新的 data.table 这个新的 data.table 的唯一键)和5个其他二进制变量,每个对应于 y 的每个类别value 1 如果id具有 y 的值, 0 否则。

输出 data.table 应如下所示:

  id abcde 
1:1 0 1 1 1 1
2:2 1 0 1 0 1
3:3 1 0 1 1 1

我试着在循环中这样做,但是它很慢,



编辑:以@mtoto格式传递二进制变量名称,以编程方式传递二进制变量名称指出,类似的问题已经提出并回答了这里,但解决方案是使用 reshape2 包。

我想知道是否有另一个更快)通过在数据表中使用:= 运算符来实现,因为我有一个庞大的数据集,我在这个包中工作很多。 / p>

EDIT2 :@ Arun在我的数据上的功能的基准(约1200万行,〜3,5万不同ID和490不同 y 变量的标签(导致490个虚拟变量)):

  system.time(ans1 < -  AnsFunction())#194s 
system.time(ans2 < - dcastFunction())#55s
system.time(ans3< ; - TableFunction())#永远阻止我的电脑


解决方案

p> data.table 有自己的 dcast 实现使用data.table的内部结构,应该快。尝试一下:

  dcast(dt,id〜y,fun.aggregate = function(x)1L,fill = 0L )
#id abcde
#1:1 0 1 1 1 1
#2:2 1 0 1 0 1
#3:3 1 0 1 1 1






只是想到另一种处理方式,预先分配和更新引用(也许dcast的逻辑应该这样做,以避免中间体)。

  ans = data.table $ id))[,unique(dt $ y):= 0L] [] 

是用 1L 填充现有组合。

  dt [,{set (ans,i = .GRP,j = unique(y),value = 1L); NULL},by = id] 
ans
#id bdcea
#1:1 1 1 1 1 0
#2:2 0 0 1 1 1
# 3:3 0 1 1 1 1






我已经在基准的基础上OP数据维度,约1000万行和10列。

  require(data.table) 
set.seed(45L)
y = apply(matrix(letters(letters,10L * 20L,TRUE),ncol = 20L),1L,paste,collapse =)
dt = data.table(id = sample(1e5,1e7,TRUE),y = sample(y,1e7,TRUE))

system.time(ans1 < - AnsFunction())#2.3s
system.time(ans2 < - dcastFunction())#2.2s
system.time(ans3 < - TableFunction())#6.2s

setcolorder (ans2))
setcolorder(ans3,names(ans2))
setorder(ans1,id)
setkey(ans2,NULL)
setorder b
相同(ans1,ans2)#TRUE
相同(ans1,ans3)#TRUE


$ b b

其中,

  AnsFunction<  -  function(){
ans = data.table(id =唯一的(dt $ id))[,unique(dt $ y):= 0L] []
dt [,{set(ans,i = .GRP,j = unique(y),value = 1L); NULL},by = id]
ans
#重新排序列外
}

dcastFunction< - function(){
# reshape2。 data.table也有自己的dcast以及
#不需要setDT
df < - dcast(dt,id〜y,fun.aggregate = function(x)1L,fill = 0L,value。 tableFunction< - function(){
#需要返回整数结果以获得相同的结果
#fixed 1 - > 1L; as.numeric - > as.integer
df< - as.data.frame.matrix(table(dt $ id,dt $ y))
df [df> 1L] < - 1L
df < - cbind(id = as.integer(row.names(df)),df)
setDT(df)
}


I have a big (12 million rows) data.table which looks like this:

library(data.table)
set.seed(123)
dt <- data.table(id=rep(1:3, each=5),y=sample(letters[1:5],15,replace = T))
> dt
    id y
 1:  1 b
 2:  1 d
 3:  1 c
 4:  1 e
 5:  1 e
 6:  2 a
 7:  2 c
 8:  2 e
 9:  2 c
10:  2 c
11:  3 e
12:  3 c
13:  3 d
14:  3 c
15:  3 a

I want to create a new data.table containing my variable id (which will be the unique key of this new data.table) and 5 other binary variables each one corresponding to each category of y which take value 1 if the id has that value for y, 0 otherwise.
The output data.table should look like this:

   id a b c d e
1:  1 0 1 1 1 1
2:  2 1 0 1 0 1
3:  3 1 0 1 1 1

I tried doing this in a loop but it's quite slow and also I don't know how to pass the binary variable names programmatically, as they depend on the variable I'm trying to "split".

EDIT: as @mtoto pointed out, a similar question has already been asked and answered here, but the solution is using the reshape2 package.
I was wondering if there's another (faster) way to do so by maybe using the := operator in data.table, as I have a massive dataset and I'm working quite a lot with this package.

EDIT2: benchmark of the functions in @Arun's post on my data (~12 million rows, ~3,5 million different ids and 490 different labels for the y variable (resulting in 490 dummy variables)):

system.time(ans1 <- AnsFunction())   # 194s
system.time(ans2 <- dcastFunction()) # 55s
system.time(ans3 <- TableFunction()) # Takes forever and blocked my PC

解决方案

data.table has its own dcast implementation using data.table's internals and should be fast. Give this a try:

dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L)
#    id a b c d e
# 1:  1 0 1 1 1 1
# 2:  2 1 0 1 0 1
# 3:  3 1 0 1 1 1


Just thought of another way to handle this by preallocating and updating by reference (perhaps dcast's logic should be done like this to avoid intermediates).

ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]

All that's left is to fill existing combinations with 1L.

dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
ans
#    id b d c e a
# 1:  1 1 1 1 1 0
# 2:  2 0 0 1 1 1
# 3:  3 0 1 1 1 1


Okay, I've gone ahead on benchmarked on OP's data dimensions with ~10 million rows and 10 columns.

require(data.table)
set.seed(45L)
y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")
dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))

system.time(ans1 <- AnsFunction())   # 2.3s
system.time(ans2 <- dcastFunction()) # 2.2s
system.time(ans3 <- TableFunction()) # 6.2s

setcolorder(ans1, names(ans2))
setcolorder(ans3, names(ans2))
setorder(ans1, id)
setkey(ans2, NULL)
setorder(ans3, id)

identical(ans1, ans2) # TRUE
identical(ans1, ans3) # TRUE

where,

AnsFunction <- function() {
    ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
    dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
    ans
    # reorder columns outside
}

dcastFunction <- function() {
    # no need to load reshape2. data.table has its own dcast as well
    # no need for setDT
    df <- dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")
}

TableFunction <- function() {
    # need to return integer results for identical results
    # fixed 1 -> 1L; as.numeric -> as.integer
    df <- as.data.frame.matrix(table(dt$id, dt$y))
    df[df > 1L] <- 1L
    df <- cbind(id = as.integer(row.names(df)), df)
    setDT(df)
}

这篇关于如何以编程方式基于data.table中的分类变量创建二进制列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆