有效地汇总(计数)列 [英] Summarize (counts) by column efficiently
问题描述
我有一个类似于 datadf
的大表,具有3000万列和行,我看到了一些方法可以在堆栈溢出中获得期望的摘要(表中每列的频率),但即使最快也很慢为我的桌子。编辑:在评论中,目前有几种方法令人满意。
I have a big table similar to datadf
with 3000 thousand columns and rows, I saw some methods to obtain my expected summary in stack overflow (Frequency of values per column in table), but even the fastest is very slow for my table. thx to comments, several methods are currently satisfactory.
library(data.table)
datadf <- data.frame(var1 = rep(c("T","A","G","C"), each = 3), var2 = rep(c("A","T","G","C"), each = 3), var3 = rep('-', 12), stringsAsFactors = F )
datadf <- datadf[sample(1:nrow(datadf), 1000, T),sample(1:ncol(datadf), 1000, T)]
dataDT <- as.data.table(datadf)
dataDT[1:7,1:20]
# var2 var2.1 var3 var3.1 var1 var3.2 var2.2 var3.3 var2.3 var1.1 var3.4 var2.4 var1.2 var3.5 var2.5 var2.6 var1.3 var3.6 var2.7 var1.4
# 1: T T - - A - T - T A - T A - T T A - T A
# 2: G G - - G - G - G G - G G - G G G - G G
# 3: T T - - A - T - T A - T A - T T A - T A
# 4: C C - - C - C - C C - C C - C C C - C C
# 5: C C - - C - C - C C - C C - C C C - C C
# 6: A A - - T - A - A T - A T - A A T - A T
# 7: T T - - A - T - T A - T A - T T A - T A
基准
Benchmark
library(tidyverse)
library(microbenchmark)
lev<-unique(unlist(datadf))
microbenchmark(
#base EDITED based on comment @ fishtank
sapply(datadf, function(x) table(factor(x, levels=lev, ordered=TRUE))), #modified based on comment
#tidyverse EDITED based on comment @AntoniosK
datadf %>% gather() %>% count(key, value) %>% spread(key, n, fill = 0L), # based on comment
#data.table
dcast(melt(dataDT, id=1:1000, measure=1:1000)[,1001:1002][, `:=` (Count = .N), by=.(variable,value)], value ~ variable ,
value.var = "value", fun.aggregate = length),
# Answer from @bk18
# myDcast<-
dcast.data.table(
melt.data.table(dataDT, measure.vars = colnames(dataDT))[, .N, .(variable, value)],
value ~ variable,
value.var = "N",
fill = 0
),
times=1
)
结果
Result
myDcast[,1:20]
# value var2 var2.1 var3 var3.1 var1 var3.2 var2.2 var3.3 var2.3 var1.1 var3.4 var2.4 var1.2 var3.5 var2.5 var2.6 var1.3 var3.6 var2.7
# 1: - 0 0 1000 1000 0 1000 0 1000 0 0 1000 0 0 1000 0 0 0 1000 0
# 2: A 254 254 0 0 280 0 254 0 254 280 0 254 280 0 254 254 280 0 254
# 3: C 230 230 0 0 230 0 230 0 230 230 0 230 230 0 230 230 230 0 230
# 4: G 236 236 0 0 236 0 236 0 236 236 0 236 236 0 236 236 236 0 236
# 5: T 280 280 0 0 254 0 280 0 280 254 0 280 254 0 280 280 254 0 280
min lq mean median uq max neval
sapply-table 89.77978 89.77978 89.77978 89.77978 89.77978 89.77978 1
gather-count-spread 849.83078 849.83078 849.83078 849.83078 849.83078 849.83078 1
dcast-melt 19938.71910 19938.71910 19938.71910 19938.71910 19938.71910 19938.71910 1
data.table @bk18 46.92746 46.92746 46.92746 46.92746 46.92746 46.92746 1
推荐答案
这大约是您提供的data.table方法速度的两倍,并且应该与数据集的大小很好地缩放:
This is about double the speed of the data.table method you provided, and should scale very well with the size of the dataset:
setDT(datadf)
dcast.data.table(
melt.data.table(datadf, measure.vars = colnames(datadf))[, .N, .(variable, value)],
value ~ variable,
value.var = "N",
fill = 0
)
我很想查看整个数据集的基准,因为并非所有这些方法的缩放比例都相似。
I'd be interested to see the benchmarks for your full dataset, because not all of these methods will scale similarly.
这篇关于有效地汇总(计数)列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!