从现有数据框或数据表创建多个虚拟图 [英] Creating multiple dummies from an existing data frame or data table

查看:76
本文介绍了从现有数据框或数据表创建多个虚拟图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找以下解决方案的快速扩展这里。在其中Frank示出了对于示例数据表

I am looking for a quick extension to the following solution posted here. In it Frank shows that for an example data table

test <- data.table("index"=rep(letters[1:10],100),"var1"=rnorm(1000,0,1))

您可以使用以下代码快速创建虚拟对象:

You can quickly make dummies by using the following code

inds <- unique(test$index) ; test[,(inds):=lapply(inds,function(x)index==x)]

现在我想扩展这个解决方案为一个data.table有多行索引,例如

Now I want to extend this solution for a data.table that has multiple rows of indices, e.g.

new <- data.table("id" = rep(c("Jan","James","Dirk","Harry","Cindy","Leslie","John","Frank"),125), "index1"=rep(letters[1:5],200),"index2" = rep(letters[6:15],100),"index3" = rep(letters[16:19],250))

我需要为许多假人做这个,理想的解决方案将允许我得到4件事:

I need to do this for many dummies and ideally the solution would allow me to get 4 things:



  1. 每个索引发生的平均时间

  2. 每个索引的每个索引的计数

  3. 每个id的每个索引

  1. The total count of every index
  2. The mean times every index occurs
  3. The count of every index per id
  4. The mean of every index per id

在我的实际情况下,索引的名称不同,所以解决方案需要能够循环通过列

In my real case, the indices are named differently so the solution would need to be able to loop through the column names I think.

感谢

Simon

推荐答案

如果你只需要列表中的四个项目,你应该列表:

If you only need the four items in that list, you should just tabulate:

indcols <- paste0('index',1:3)
lapply(new[,indcols,with=FALSE],table) # counts
lapply(new[,indcols,with=FALSE],function(x)prop.table(table(x))) # means

# or...

lapply(
  new[,indcols,with=FALSE],
  function(x){
    z<-table(x)
    rbind(count=z,mean=prop.table(z))
  })

这会提供

$index1
          a     b     c     d     e
count 200.0 200.0 200.0 200.0 200.0
mean    0.2   0.2   0.2   0.2   0.2

$index2
          f     g     h     i     j     k     l     m     n     o
count 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
mean    0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1

$index3
           p      q      r      s
count 250.00 250.00 250.00 250.00
mean    0.25   0.25   0.25   0.25









以前的方法适用于data.frame或data.table,但是相当复杂。使用data.table,可以使用融化语法:

melt(new, id="id")[,.(
  N=.N, 
  mean=.N/nrow(new)
), by=.(variable,value)]

    variable value   N mean
 1:   index1     a 200 0.20
 2:   index1     b 200 0.20
 3:   index1     c 200 0.20
 4:   index1     d 200 0.20
 5:   index1     e 200 0.20
 6:   index2     f 100 0.10
 7:   index2     g 100 0.10
 8:   index2     h 100 0.10
 9:   index2     i 100 0.10
10:   index2     j 100 0.10
11:   index2     k 100 0.10
12:   index2     l 100 0.10
13:   index2     m 100 0.10
14:   index2     n 100 0.10
15:   index2     o 100 0.10
16:   index3     p 250 0.25
17:   index3     q 250 0.25
18:   index3     r 250 0.25
19:   index3     s 250 0.25

@Arun在注释中提到了这种方法?)。要了解它是如何工作的,首先看看 melt(new,id =id),它会转换原始的data.table。

This approach was mentioned by @Arun in a comment (and implemented by him also, I think..?). To see how it works, first have a look at melt(new, id="id") which transforms the original data.table.

如注释中所述,熔化data.table需要安装并加载 reshape2 用于某些版本的 data.table 包。

As mentioned in the comments, melting a data.table requires installing and loading reshape2 for some versions of the data.table package.

那么它们可以在一个循环中作为链接问题进行:

If you also need the dummies, they can be made in a loop as in the linked question:

newcols <- list()
for (i in indcols){
    vals = unique(new[[i]])
    newcols[[i]] = paste(vals,i,sep='_')
    new[,(newcols[[i]]):=lapply(vals,function(x)get(i)==x)]
}

为方便起见,它存储与 newcols 中的每个变量相关联的列组。如果你想使用这些虚拟变量(而不是上面的解决方案中的基础变量)来做表格,你可以做

This stores the groups of columns associated with each variable in newcols for convenience. If you wanted to do the tabulation just with these dummies (instead of the underlying variables as in solution above), you could do

lapply(
  indcols,
  function(i) new[,lapply(.SD,function(x){
    z <- sum(x)
    list(z,z/.N)
  }),.SDcols=newcols[[i]] ])

这给出类似的结果。我只是这样写,以说明如何可以使用 data.table 语法。您可以再次避免使用方括号和 .SD 这里:

which gives a similar result. I just wrote it this way to illustrate how data.table syntax can be used. You could again avoid square brackets and .SD here:

lapply(
  indcols,
  function(i) sapply(
    new[, newcols[[i]], with=FALSE],
    function(x){
      z<-sum(x)
      rbind(z,z/length(x))
    }))


b $ b

但无论如何:只要使用 table ,如果你可以保持基础变量。

But anyway: just use table if you can hold onto the underlying variables.

这篇关于从现有数据框或数据表创建多个虚拟图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆