快速交叉表和所有变量对的统计 [英] Fast crosstabs and stats on all pairs of variables
问题描述
我试图计算 data.table
中所有变量之间的关联度量。 (这是不统计问题,但作为一个:变量是所有因素,并且度量是Cramér's V 。)
I am trying to calculate a measure of association between all variables in a data.table
. (This is not a stats question, but as an aside: the variables are all factors, and the measure is Cramér's V.)
示例数据集:
p = 50; n = 1e5; # actual dataset has p > 1e3, n > 1e5, much wider but barely longer
set.seed(1234)
obs <- as.data.table(
data.frame(
cbind( matrix(sample(c(LETTERS[1:4],NA), n*(p/2), replace=TRUE),
nrow=n, ncol=p/2),
matrix(sample(c(letters[1:6],NA), n*(p/2), replace=TRUE),
nrow=n, ncol=p/2) ),
stringsAsFactors=TRUE ) )
我目前正在使用split-apply-combine方法,它涉及循环(通过 plyr :: adply
),并返回每一对的一行。 (我尝试并行化 adply
但失败。)
I am currently using the split-apply-combine approach, which involves looping (via plyr::adply
) through all pairs of indices and returning one row for each pair. (I attempted to parallelize adply
but failed.)
# Calculate Cramér's V between all variables -- my kludgey approach
pairs <- t( combn(ncol(obs), 2) ) # nx2 matrix contains indices of upper triangle of df
# library('doParallel') # I tried to parallelize -- bonus points for help here (Win 7)
# cl <- makeCluster(8)
# registerDoParallel(cl)
library('plyr')
out <- adply(pairs, 1, function(ix) {
complete_cases <- obs[,which(complete.cases(.SD)), .SDcols=ix]
chsq <- chisq.test(x= dcast(data = obs[complete_cases, .SD, .SDcols=ix],
formula = paste( names(obs)[ix], collapse='~'),
value.var = names(obs)[ix][1], # arbitrary
fun.aggregate=length)[,-1, with=FALSE] )
return(data.table(index_1 = ix[1],
var_1 = names(obs)[ix][1],
index_2 = ix[2],
var_2 = names(obs)[ix][2],
cramers_v = sqrt(chsq$statistic /
(sum(chsq$observed) *
(pmin(nrow(chsq$observed),
ncol(chsq$observed) ) -1 ) )
) )
)
})[,-1] #}, .parallel = TRUE)[,-1] # using .parallel returns Error in do.ply(i) :
# task 1 failed - "object 'obs' not found"
out <- data.table(out) # adply won't return a data.table
# stopCluster(cl)
我的选项是什么可以加速这个计算?我的挑战是将对的逐行操作传递到
obs
中的逐列计算。我想知道是否可以直接生成 J
的列对,但Force只是不够强与这个data.table padawan。
What are my options for speeding up this calculation? My challenge is in passing the row-wise operation on pairs
into the column-wise calculations in obs
. I am wondering if it is possible to generate the column pairs directly into J
, but the Force is just not strong enough with this data.table padawan.
推荐答案
首先,我将使用'long'数据格式如下:
First, I would go with 'long' data format as following:
obs[, id := 1:n]
mobs <- melt(obs, id.vars = 'id')
接下来在数据表上设置键 setkeyv(mobs,'id')
。
Next set key on data table setkeyv(mobs, 'id')
.
最后,遍历变量并对对执行计算:
Finally, iterate through variables and do calculations on pairs:
out <- list()
for(i in 1:p) {
vari <- paste0('X', i)
tmp <- mobs[mobs[variable == vari]]
nn <- tmp[!(is.na(value) | is.na(i.value)), list(i.variable = i.variable[1], nij = length(id)), keyby = list(variable, value, i.value)]
cj <- nn[, CJ(value = value, i.value = i.value, sorted = FALSE, unique = TRUE), by = variable]
setkeyv(cj, c('variable', 'value', 'i.value'))
nn <- nn[cj]
nn[is.na(nij), nij := 0]
nn[, ni := sum(nij), by = list(variable, i.value)]
nn[, nj := sum(nij), by = list(variable, value)]
nn[, c('n', 'r', 'k') := list(sum(nij), length(unique(i.value)), length(unique(value))), by = variable]
out[[i]] <- nn[, list(i.variable = vari, cramers_v = (sqrt(sum((nij - ni * nj / n) ^ 2 / (ni * nj / n)) / n[1]) / min(k[1] - 1, r[1] - 1))), by = variable]
}
out <- rbindlist(out)
所以你需要通过变量只重复一次。如你所见,我也不会使用 chisq.test
并自己编写计算。
So you need to iterate only once through variables. As you see I would also wouldn't use chisq.test
and would write computations myself.
这篇关于快速交叉表和所有变量对的统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!