快速交叉表和所有变量对的统计 [英] Fast crosstabs and stats on all pairs of variables

查看:98
本文介绍了快速交叉表和所有变量对的统计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图计算 data.table 中所有变量之间的关联度量。 (这是统计问题,但作为一个:变量是所有因素,并且度量是Cramér's V 。)

I am trying to calculate a measure of association between all variables in a data.table. (This is not a stats question, but as an aside: the variables are all factors, and the measure is Cramér's V.)

示例数据集:

p = 50; n = 1e5; # actual dataset has p > 1e3, n > 1e5, much wider but barely longer
set.seed(1234)
obs <- as.data.table( 
         data.frame(
           cbind( matrix(sample(c(LETTERS[1:4],NA), n*(p/2), replace=TRUE),
                         nrow=n, ncol=p/2),
                  matrix(sample(c(letters[1:6],NA), n*(p/2), replace=TRUE),
                         nrow=n, ncol=p/2) ),
         stringsAsFactors=TRUE ) )

我目前正在使用split-apply-combine方法,它涉及循环(通过 plyr :: adply ),并返回每一对的一行。 (我尝试并行化 adply 但失败。)

I am currently using the split-apply-combine approach, which involves looping (via plyr::adply) through all pairs of indices and returning one row for each pair. (I attempted to parallelize adply but failed.)

# Calculate Cramér's V between all variables -- my kludgey approach

pairs <- t( combn(ncol(obs), 2) ) # nx2 matrix contains indices of upper triangle of df

# library('doParallel') # I tried to parallelize -- bonus points for help here (Win 7)
# cl <- makeCluster(8)
# registerDoParallel(cl)
library('plyr')
out <- adply(pairs, 1, function(ix) {
        complete_cases <- obs[,which(complete.cases(.SD)), .SDcols=ix]
        chsq <- chisq.test(x= dcast(data = obs[complete_cases, .SD, .SDcols=ix],
                                    formula = paste( names(obs)[ix], collapse='~'), 
                                    value.var = names(obs)[ix][1], # arbitrary
                                    fun.aggregate=length)[,-1, with=FALSE] )
        return(data.table(index_1 = ix[1],
                          var_1 =  names(obs)[ix][1],
                          index_2 = ix[2],
                          var_2 =  names(obs)[ix][2],
                          cramers_v = sqrt(chsq$statistic / 
                                             (sum(chsq$observed) *
                                                (pmin(nrow(chsq$observed),
                                                      ncol(chsq$observed) ) -1  ) )
                          ) ) 
        )
      })[,-1] #}, .parallel = TRUE)[,-1] # using .parallel returns Error in do.ply(i) : 
                                       # task 1 failed - "object 'obs' not found"
out <- data.table(out) # adply won't return a data.table   
# stopCluster(cl)

我的选项是什么可以加速这个计算?我的挑战是将对的逐行操作传递到 obs 中的逐列计算。我想知道是否可以直接生成 J 的列对,但Force只是不够强与这个data.table padawan。

What are my options for speeding up this calculation? My challenge is in passing the row-wise operation on pairs into the column-wise calculations in obs. I am wondering if it is possible to generate the column pairs directly into J, but the Force is just not strong enough with this data.table padawan.

推荐答案

首先,我将使用'long'数据格式如下:

First, I would go with 'long' data format as following:

obs[, id := 1:n]
mobs <- melt(obs, id.vars = 'id')

接下来在数据表上设置键 setkeyv(mobs,'id')

Next set key on data table setkeyv(mobs, 'id').

最后,遍历变量并对对执行计算:

Finally, iterate through variables and do calculations on pairs:

out <- list()
for(i in 1:p) {
  vari <- paste0('X', i)
  tmp <- mobs[mobs[variable == vari]]
  nn <- tmp[!(is.na(value) | is.na(i.value)), list(i.variable = i.variable[1], nij = length(id)), keyby = list(variable, value, i.value)]
  cj <- nn[, CJ(value = value, i.value = i.value, sorted = FALSE, unique = TRUE), by = variable]
  setkeyv(cj, c('variable', 'value', 'i.value'))
  nn <- nn[cj]
  nn[is.na(nij), nij := 0]
  nn[, ni := sum(nij), by = list(variable, i.value)]
  nn[, nj := sum(nij), by = list(variable, value)]
  nn[, c('n', 'r', 'k') := list(sum(nij), length(unique(i.value)), length(unique(value))), by = variable]
  out[[i]] <- nn[, list(i.variable = vari, cramers_v = (sqrt(sum((nij - ni * nj / n) ^ 2 / (ni * nj / n)) / n[1]) / min(k[1] - 1, r[1] - 1))), by = variable]
}
out <- rbindlist(out)

所以你需要通过变量只重复一次。如你所见,我也不会使用 chisq.test 并自己编写计算。

So you need to iterate only once through variables. As you see I would also wouldn't use chisq.test and would write computations myself.

这篇关于快速交叉表和所有变量对的统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆