来自字符串变量的虚拟变量 [英] Dummy variables from a string variable

查看:65
本文介绍了来自字符串变量的虚拟变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从这个数据集创建虚拟变量:

I would like to create dummy variables form this dataset:

DF<-structure(list(A = c(1, 2, 3, 4, 5), B = c("1,3,2", "2,1,3,6", 
  "3,2,5,1,7", "3,7,4,2,6,5", "4,10,7,3,5,6")), .Names = c("A", "B"), 
              row.names = c(NA, 5L), class = "data.frame")
> DF
  A                  B
1 1              1,3,2
2 2            2,1,3,6
3 3          3,2,5,1,7
4 4        3,7,4,2,6,5
5 5       4,10,7,3,5,6

所需的输出应该是这样的:

Desired output shoud look like this:

A  1  2  3  4  5  6  7  8  9  10
1  1  1  1  0  0  0  0  0  0  0
2  1  1  1  0  0  1  0  0  0  0
3  1  1  1  0  1  0  1  0  0  0
4  0  1  1  1  1  1  1  0  0  0
5  0  0  1  1  1  1  1  0  0  1

有没有一种有效的方法来做这样的事情?我可以使用 strsplitifelse.原始数据集非常大,有许多行 (>10k) 和 B 列中的值 (>15k).包 dummies 中的函数 dummy 不能像我想要的那样工作.

Is there a efficient way to do such thing? I can use strsplit or ifelse. Original dataset is very large with many rows (>10k) and values in column B (>15k). Function dummy from package dummies don't work as I want to.

我还发现了类似的情况:将一列拆分为多列.但是上面链接中的答案在我的情况下工作真的很慢(在我的戴尔 i7-2630QM、8Gb、Win7 64 位、R 2.15.3 64 位上最多 15 分钟).

I also found simmilar case: Splitting one column into multiple columns. But the anwsers from the link above work really slow in my case (up to 15 minutes on my Dell i7-2630QM, 8Gb, Win7 64 bit, R 2.15.3 64bit).

预先感谢您的回答.

推荐答案

UPDATE

此处提到的函数现已移至 CRAN 上可用的名为splitstackshape"的包中.CRAN 上的版本比这个原始版本快得多.速度应该类似于您在本答案末尾使用直接 for 循环解决方案所获得的速度.有关详细基准,请参阅 @Ricardo 的回答.

UPDATE

The function mentioned here has now been moved to a package available on CRAN called "splitstackshape". The version on CRAN is considerably faster than this original version. The speeds should be similar to what you would get with the direct for loop solution at the end of this answer. See @Ricardo's answer for detailed benchmarks.

安装,使用concat.split.expanded得到想要的结果:

Install it, and use concat.split.expanded to get the desired result:

library(splitstackshape)
concat.split.expanded(DF, "B", fill = 0, drop = TRUE)
#   A B_01 B_02 B_03 B_04 B_05 B_06 B_07 B_08 B_09 B_10
# 1 1    1    1    1    0    0    0    0    0    0    0
# 2 2    1    1    1    0    0    1    0    0    0    0
# 3 3    1    1    1    0    1    0    1    0    0    0
# 4 4    0    1    1    1    1    1    1    0    0    0
# 5 5    0    0    1    1    1    1    1    0    0    1

<小时>

原帖

不久前,我编写了一个函数,不仅可以进行这种拆分,还可以进行其他拆分.该函数名为 concat.split(),可以在 此处.

A while ago, I had written a function to do not just this sort of splitting, but others. The function, named concat.split(), can be found here.

对于您的示例数据,用法为:

The usage, for your example data, would be:

## Keeping the original column
concat.split(DF, "B", structure="expanded")
#   A            B B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1        1,3,2   1   1   1  NA  NA  NA  NA  NA  NA   NA
# 2 2      2,1,3,6   1   1   1  NA  NA   1  NA  NA  NA   NA
# 3 3    3,2,5,1,7   1   1   1  NA   1  NA   1  NA  NA   NA
# 4 4  3,7,4,2,6,5  NA   1   1   1   1   1   1  NA  NA   NA
# 5 5 4,10,7,3,5,6  NA  NA   1   1   1   1   1  NA  NA    1

## Dropping the original column
concat.split(DF, "B", structure="expanded", drop.col=TRUE)
#   A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1   1   1   1  NA  NA  NA  NA  NA  NA   NA
# 2 2   1   1   1  NA  NA   1  NA  NA  NA   NA
# 3 3   1   1   1  NA   1  NA   1  NA  NA   NA
# 4 4  NA   1   1   1   1   1   1  NA  NA   NA
# 5 5  NA  NA   1   1   1   1   1  NA  NA    1

将 NA 重新编码为 0 必须手动完成——也许我会更新函数以添加一个选项来这样做,同时,实现这些更快的解决方案之一:)

Recoding NA to 0 has to be done manually--perhaps I'll update the function to add an option to do so, and at the same time, implement one of these faster solutions :)

temp <- concat.split(DF, "B", structure="expanded", drop.col=TRUE)
temp[is.na(temp)] <- 0
temp
#   A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1   1   1   1   0   0   0   0   0   0    0
# 2 2   1   1   1   0   0   1   0   0   0    0
# 3 3   1   1   1   0   1   0   1   0   0    0
# 4 4   0   1   1   1   1   1   1   0   0    0
# 5 5   0   0   1   1   1   1   1   0   0    1

<小时>

更新

concat.split 函数中的大部分开销可能来自于从 matrix 转换为 data.frame、重命名等列,等等.用于进行拆分的实际代码是一个 GASP for 循环,但对其进行测试,您会发现它执行得非常好:


Update

Most of the overhead in the concat.split function probably comes in things like converting from a matrix to a data.frame, renaming the columns, and so on. The actual code used to do the splitting is a GASP for loop, but test it out, and you'll find that it performs pretty well:

b = strsplit(DF$B, ",")
ncol = max(as.numeric(unlist(b)))
temp = lapply(b, as.numeric)
## Set up an empty matrix
m = matrix(0, nrow = nrow(DF), ncol = ncol)      
## Fill it in
for (i in 1:nrow(DF)) {
  m[i, temp[[i]]] = 1
}
## View your result
m 

这篇关于来自字符串变量的虚拟变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆