从现有数据框或数据表创建多个虚拟图 [英] Creating multiple dummies from an existing data frame or data table
问题描述
我正在寻找以下解决方案的快速扩展这里。在其中Frank示出了对于示例数据表
I am looking for a quick extension to the following solution posted here. In it Frank shows that for an example data table
test <- data.table("index"=rep(letters[1:10],100),"var1"=rnorm(1000,0,1))
您可以使用以下代码快速创建虚拟对象:
You can quickly make dummies by using the following code
inds <- unique(test$index) ; test[,(inds):=lapply(inds,function(x)index==x)]
现在我想扩展这个解决方案为一个data.table有多行索引,例如
Now I want to extend this solution for a data.table that has multiple rows of indices, e.g.
new <- data.table("id" = rep(c("Jan","James","Dirk","Harry","Cindy","Leslie","John","Frank"),125), "index1"=rep(letters[1:5],200),"index2" = rep(letters[6:15],100),"index3" = rep(letters[16:19],250))
我需要为许多假人做这个,理想的解决方案将允许我得到4件事:
I need to do this for many dummies and ideally the solution would allow me to get 4 things:
- 每个索引发生的平均时间
- 每个索引的每个索引的计数
- 每个id的每个索引
- The total count of every index
- The mean times every index occurs
- The count of every index per id
- The mean of every index per id
在我的实际情况下,索引的名称不同,所以解决方案需要能够循环通过列
In my real case, the indices are named differently so the solution would need to be able to loop through the column names I think.
感谢
Simon
推荐答案
如果你只需要列表中的四个项目,你应该列表:
If you only need the four items in that list, you should just tabulate:
indcols <- paste0('index',1:3)
lapply(new[,indcols,with=FALSE],table) # counts
lapply(new[,indcols,with=FALSE],function(x)prop.table(table(x))) # means
# or...
lapply(
new[,indcols,with=FALSE],
function(x){
z<-table(x)
rbind(count=z,mean=prop.table(z))
})
这会提供
$index1
a b c d e
count 200.0 200.0 200.0 200.0 200.0
mean 0.2 0.2 0.2 0.2 0.2
$index2
f g h i j k l m n o
count 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
mean 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
$index3
p q r s
count 250.00 250.00 250.00 250.00
mean 0.25 0.25 0.25 0.25
以前的方法适用于data.frame或data.table,但是相当复杂。使用data.table,可以使用融化
语法:
melt(new, id="id")[,.(
N=.N,
mean=.N/nrow(new)
), by=.(variable,value)]
variable value N mean
1: index1 a 200 0.20
2: index1 b 200 0.20
3: index1 c 200 0.20
4: index1 d 200 0.20
5: index1 e 200 0.20
6: index2 f 100 0.10
7: index2 g 100 0.10
8: index2 h 100 0.10
9: index2 i 100 0.10
10: index2 j 100 0.10
11: index2 k 100 0.10
12: index2 l 100 0.10
13: index2 m 100 0.10
14: index2 n 100 0.10
15: index2 o 100 0.10
16: index3 p 250 0.25
17: index3 q 250 0.25
18: index3 r 250 0.25
19: index3 s 250 0.25
@Arun在注释中提到了这种方法?)。要了解它是如何工作的,首先看看 melt(new,id =id)
,它会转换原始的data.table。
This approach was mentioned by @Arun in a comment (and implemented by him also, I think..?). To see how it works, first have a look at melt(new, id="id")
which transforms the original data.table.
如注释中所述,熔化data.table需要安装并加载 reshape2
用于某些版本的 data.table
包。
As mentioned in the comments, melting a data.table requires installing and loading reshape2
for some versions of the data.table
package.
那么它们可以在一个循环中作为链接问题进行:
If you also need the dummies, they can be made in a loop as in the linked question:
newcols <- list()
for (i in indcols){
vals = unique(new[[i]])
newcols[[i]] = paste(vals,i,sep='_')
new[,(newcols[[i]]):=lapply(vals,function(x)get(i)==x)]
}
为方便起见,它存储与 newcols
中的每个变量相关联的列组。如果你想使用这些虚拟变量(而不是上面的解决方案中的基础变量)来做表格,你可以做
This stores the groups of columns associated with each variable in newcols
for convenience. If you wanted to do the tabulation just with these dummies (instead of the underlying variables as in solution above), you could do
lapply(
indcols,
function(i) new[,lapply(.SD,function(x){
z <- sum(x)
list(z,z/.N)
}),.SDcols=newcols[[i]] ])
这给出类似的结果。我只是这样写,以说明如何可以使用 data.table
语法。您可以再次避免使用方括号和 .SD
这里:
which gives a similar result. I just wrote it this way to illustrate how data.table
syntax can be used. You could again avoid square brackets and .SD
here:
lapply(
indcols,
function(i) sapply(
new[, newcols[[i]], with=FALSE],
function(x){
z<-sum(x)
rbind(z,z/length(x))
}))
b $ b
但无论如何:只要使用 table
,如果你可以保持基础变量。
But anyway: just use table
if you can hold onto the underlying variables.
这篇关于从现有数据框或数据表创建多个虚拟图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!