R:根据列表的分类变量创建伪变量 [英] R: create dummy variables based on a categorical variable *of lists*

查看:115
本文介绍了R:根据列表的分类变量创建伪变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有分类变量的数据框,该变量包含字符串的 lists ,并且长度可变(这很重要,因为否则该问题将与),例如:

I have a data frame with a categorical variable holding lists of strings, with variable length (it is important because otherwise this question would be a duplicate of this or this), e.g.:

df <- data.frame(x = 1:5)
df$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E")
df

  x       y
1 1       A
2 2    A, B
3 3       C
4 4 B, D, C
5 5       E

所需的形式是df$y中任何地方看到的每个唯一字符串的虚拟变量,即:

And the desired form is a dummy variable for each unique string seen anywhere in df$y, i.e.:

data.frame(x = 1:5, A = c(1,1,0,0,0), B = c(0,1,0,1,0), C = c(0,0,1,1,0), D = c(0,0,0,1,0), E = c(0,0,0,0,1))

  x A B C D E
1 1 1 0 0 0 0
2 2 1 1 0 0 0
3 3 0 0 1 0 0
4 4 0 1 1 1 0
5 5 0 0 0 0 1

这种幼稚的方法有效:

> uniqueStrings <- unique(unlist(df$y))
> n <- ncol(df)
> for (i in 1:length(uniqueStrings)) {
+   df[,  n + i] <- sapply(df$y, function(x) ifelse(uniqueStrings[i] %in% x, 1, 0))
+   colnames(df)[n + i] <- uniqueStrings[i]
+ }

但是,使用大数据帧非常难看,懒惰和缓慢.

However it is very ugly, lazy and slow with big data frames.

有什么建议吗? tidyverse看中了什么?

Any suggestions? Something fancy from the tidyverse?

更新:我在下面有3种不同的方法.我在(em> real 数据集上使用(c7>在我的(Windows 7、32GB RAM)笔记本电脑上使用system.time进行了测试,该数据集包含1M行,每行包含一个长度为1到4个字符串的列表)(约350个)唯一字符串值),磁盘上总共200MB.因此,预期结果是一个尺寸为1M x 350的数据帧.tidyverse(@Sotos)和base(@ joel.wilson)方法花费了很长时间,我不得不重新启动R.qdapTools(@akrun)但效果很好:

UPDATE: I got 3 different approaches below. I tested them using system.time on my (Windows 7, 32GB RAM) laptop on a real dataset, comprising of 1M rows, each row containing a list of length 1 to 4 strings (out of ~350 unique string values), overall 200MB on disk. So the expected result is a data frame with dimensions 1M x 350. The tidyverse (@Sotos) and base (@joel.wilson) approaches took so long I had to restart R. The qdapTools (@akrun) approach however worked fantastic:

> system.time(res1 <- mtabulate(varsLists))
   user  system elapsed 
  47.05   10.27  116.82

这就是我将接受的方法.

So this is the approach I'll mark accepted.

推荐答案

我们可以使用mtabulate

library(qdapTools)
cbind(df[1], mtabulate(df$y))
#  x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1

这篇关于R:根据列表的分类变量创建伪变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆