如何以编程方式基于data.table中的分类变量创建二进制列？ [英] How to programmatically create binary columns based on a categorical variable in data.table?

查看：183 发布时间：2017/3/12 10:54:08 r data.table binary-data programmatically-created

本文介绍了如何以编程方式基于data.table中的分类变量创建二进制列？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个大的（12百万行） data.table 它看起来像这样：

  library（data.table）
 set.seed（123）
 dt<  -  data.table（id = rep（1： each = 5），y = sample（letters [1：5]，15，replace = T））
> dt 
 id y 
 1：1 b 
 2：1 d 
 3：1 c 
 4：1 e 
 5：1 e $ b b 6：2 a 
 7：2 c 
 8：2 e 
 9：2 c 
 10：2 c 
 11：3 e 
 12 ：3 c 
 13：3 d 
 14：3 c 
 15：3 a

b $ b

我想创建一个包含我的变量 id 的新的 data.table 这个新的 data.table 的唯一键）和5个其他二进制变量，每个对应于 y 的每个类别value 1 如果id具有 y 的值， 0 否则。

输出 data.table 应如下所示：

  id abcde 
 1：1 0 1 1 1 1 
 2：2 1 0 1 0 1 
 3：3 1 0 1 1 1

我试着在循环中这样做，但是它很慢，

编辑：以@mtoto格式传递二进制变量名称，以编程方式传递二进制变量名称指出，类似的问题已经提出并回答了这里，但解决方案是使用 reshape2 包。

我想知道是否有另一个更快）通过在数据表中使用：= 运算符来实现，因为我有一个庞大的数据集，我在这个包中工作很多。 / p>

EDIT2 ：@ Arun在我的数据上的功能的基准（约1200万行，〜3,5万不同ID和490不同 y 变量的标签（导致490个虚拟变量））：

  system.time（ans1 < -  AnsFunction（））＃194s 
 system.time（ans2 < -  dcastFunction（））＃55s 
 system.time（ans3< ;  -  TableFunction（））＃永远阻止我的电脑

解决方案

p> data.table 有自己的 dcast 实现使用data.table的内部结构，应该快。尝试一下：

  dcast（dt，id〜y，fun.aggregate = function（x）1L，fill = 0L ）
＃id abcde 
＃1：1 0 1 1 1 1 
＃2：2 1 0 1 0 1 
＃3：3 1 0 1 1 1

只是想到另一种处理方式，预先分配和更新引用（也许dcast的逻辑应该这样做，以避免中间体）。

  ans = data.table $ id））[，unique（dt $ y）：= 0L] []

是用 1L 填充现有组合。

  dt [，{set （ans，i = .GRP，j = unique（y），value = 1L）; NULL}，by = id] 
 ans 
＃id bdcea 
＃1：1 1 1 1 1 0 
＃2：2 0 0 1 1 1 
＃ 3：3 0 1 1 1 1

我已经在基准的基础上OP数据维度，约1000万行和10列。

  require（data.table） 
 set.seed（45L）
y = apply（matrix（letters（letters，10L * 20L，TRUE），ncol = 20L），1L，paste，collapse =）
 dt = data.table（id = sample（1e5,1e7，TRUE），y = sample（y，1e7，TRUE））
 
 system.time（ans1 < -  AnsFunction（））＃2.3s 
 system.time（ans2 < -  dcastFunction（））＃2.2s 
 system.time（ans3 < -  TableFunction（））＃6.2s 
 
 setcolorder （ans2））
 setcolorder（ans3，names（ans2））
 setorder（ans1，id）
 setkey（ans2，NULL）
 setorder b 
相同（ans1，ans2）＃TRUE 
相同（ans1，ans3）＃TRUE

$ b b

其中，

  AnsFunction<  -  function（）{
 ans = data.table（id =唯一的（dt $ id））[，unique（dt $ y）：= 0L] [] 
 dt [，{set（ans，i = .GRP，j = unique（y），value = 1L）; NULL}，by = id] 
 ans 
＃重新排序列外
} 
 
 dcastFunction<  -  function（）{
＃ reshape2。 data.table也有自己的dcast以及
＃不需要setDT 
 df < -  dcast（dt，id〜y，fun.aggregate = function（x）1L，fill = 0L，value。 tableFunction<  -  function（）{
＃需要返回整数结果以获得相同的结果
＃fixed 1  - > 1L; as.numeric  - > as.integer 
 df<  -  as.data.frame.matrix（table（dt $ id，dt $ y））
 df [df> 1L] < -  1L 
 df < -  cbind（id = as.integer（row.names（df）），df）
 setDT（df）
}

I have a big (12 million rows) data.table which looks like this:

library(data.table)
set.seed(123)
dt <- data.table(id=rep(1:3, each=5),y=sample(letters[1:5],15,replace = T))
> dt
    id y
 1:  1 b
 2:  1 d
 3:  1 c
 4:  1 e
 5:  1 e
 6:  2 a
 7:  2 c
 8:  2 e
 9:  2 c
10:  2 c
11:  3 e
12:  3 c
13:  3 d
14:  3 c
15:  3 a

I want to create a new data.table containing my variable id (which will be the unique key of this new data.table) and 5 other binary variables each one corresponding to each category of y which take value 1 if the id has that value for y, 0 otherwise.
The output data.table should look like this:

   id a b c d e
1:  1 0 1 1 1 1
2:  2 1 0 1 0 1
3:  3 1 0 1 1 1

I tried doing this in a loop but it's quite slow and also I don't know how to pass the binary variable names programmatically, as they depend on the variable I'm trying to "split".

EDIT: as @mtoto pointed out, a similar question has already been asked and answered here, but the solution is using the reshape2 package.
I was wondering if there's another (faster) way to do so by maybe using the := operator in data.table, as I have a massive dataset and I'm working quite a lot with this package.

EDIT2: benchmark of the functions in @Arun's post on my data (~12 million rows, ~3,5 million different ids and 490 different labels for the y variable (resulting in 490 dummy variables)):

system.time(ans1 <- AnsFunction())   # 194s
system.time(ans2 <- dcastFunction()) # 55s
system.time(ans3 <- TableFunction()) # Takes forever and blocked my PC

解决方案

data.table has its own dcast implementation using data.table's internals and should be fast. Give this a try:

dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L)
#    id a b c d e
# 1:  1 0 1 1 1 1
# 2:  2 1 0 1 0 1
# 3:  3 1 0 1 1 1

Just thought of another way to handle this by preallocating and updating by reference (perhaps dcast's logic should be done like this to avoid intermediates).

ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]

All that's left is to fill existing combinations with 1L.

dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
ans
#    id b d c e a
# 1:  1 1 1 1 1 0
# 2:  2 0 0 1 1 1
# 3:  3 0 1 1 1 1

Okay, I've gone ahead on benchmarked on OP's data dimensions with ~10 million rows and 10 columns.

require(data.table)
set.seed(45L)
y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")
dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))

system.time(ans1 <- AnsFunction())   # 2.3s
system.time(ans2 <- dcastFunction()) # 2.2s
system.time(ans3 <- TableFunction()) # 6.2s

setcolorder(ans1, names(ans2))
setcolorder(ans3, names(ans2))
setorder(ans1, id)
setkey(ans2, NULL)
setorder(ans3, id)

identical(ans1, ans2) # TRUE
identical(ans1, ans3) # TRUE

where,

AnsFunction <- function() {
    ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
    dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
    ans
    # reorder columns outside
}

dcastFunction <- function() {
    # no need to load reshape2. data.table has its own dcast as well
    # no need for setDT
    df <- dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")
}

TableFunction <- function() {
    # need to return integer results for identical results
    # fixed 1 -> 1L; as.numeric -> as.integer
    df <- as.data.frame.matrix(table(dt$id, dt$y))
    df[df > 1L] <- 1L
    df <- cbind(id = as.integer(row.names(df)), df)
    setDT(df)
}

这篇关于如何以编程方式基于data.table中的分类变量创建二进制列？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何以编程方式基于data.table中的分类变量创建二进制列？ [英] How to programmatically create binary columns based on a categorical variable in data.table?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何以编程方式基于data.table中的分类变量创建二进制列？ [英] How to programmatically create binary columns based on a categorical variable in data.table?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭