有效地汇总(计数)列 [英] Summarize (counts) by column efficiently

查看:71
本文介绍了有效地汇总(计数)列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个类似于 datadf 的大表,具有3000万列和行,我看到了一些方法可以在堆栈溢出中获得期望的摘要(表中每列的频率),但即使最快也很慢为我的桌子。编辑:在评论中,目前有几种方法令人满意。

I have a big table similar to datadf with 3000 thousand columns and rows, I saw some methods to obtain my expected summary in stack overflow (Frequency of values per column in table), but even the fastest is very slow for my table. thx to comments, several methods are currently satisfactory.

library(data.table)

datadf <- data.frame(var1 = rep(c("T","A","G","C"), each = 3), var2 = rep(c("A","T","G","C"), each = 3), var3 = rep('-', 12), stringsAsFactors = F )
datadf <- datadf[sample(1:nrow(datadf), 1000, T),sample(1:ncol(datadf), 1000, T)]
dataDT <- as.data.table(datadf)
dataDT[1:7,1:20]
#    var2 var2.1 var3 var3.1 var1 var3.2 var2.2 var3.3 var2.3 var1.1 var3.4 var2.4 var1.2 var3.5 var2.5 var2.6 var1.3 var3.6 var2.7 var1.4
# 1:    T      T    -      -    A      -      T      -      T      A      -      T      A      -      T      T      A      -      T      A
# 2:    G      G    -      -    G      -      G      -      G      G      -      G      G      -      G      G      G      -      G      G
# 3:    T      T    -      -    A      -      T      -      T      A      -      T      A      -      T      T      A      -      T      A
# 4:    C      C    -      -    C      -      C      -      C      C      -      C      C      -      C      C      C      -      C      C
# 5:    C      C    -      -    C      -      C      -      C      C      -      C      C      -      C      C      C      -      C      C
# 6:    A      A    -      -    T      -      A      -      A      T      -      A      T      -      A      A      T      -      A      T
# 7:    T      T    -      -    A      -      T      -      T      A      -      T      A      -      T      T      A      -      T      A



基准



Benchmark

library(tidyverse)
library(microbenchmark)
lev<-unique(unlist(datadf))

microbenchmark(
  #base EDITED based on comment @ fishtank
  sapply(datadf, function(x) table(factor(x, levels=lev, ordered=TRUE))), #modified based on comment

  #tidyverse EDITED based on comment @AntoniosK
  datadf %>% gather() %>% count(key, value) %>% spread(key, n, fill = 0L), # based on comment

  #data.table
  dcast(melt(dataDT, id=1:1000, measure=1:1000)[,1001:1002][, `:=` (Count = .N), by=.(variable,value)], value ~ variable ,
        value.var = "value", fun.aggregate = length),

  # Answer from @bk18
  # myDcast<-
  dcast.data.table(
    melt.data.table(dataDT, measure.vars = colnames(dataDT))[, .N, .(variable, value)],
    value ~ variable,
    value.var = "N",
    fill = 0
  ),
  times=1
)



结果



Result

myDcast[,1:20]

#    value var2 var2.1 var3 var3.1 var1 var3.2 var2.2 var3.3 var2.3 var1.1 var3.4 var2.4 var1.2 var3.5 var2.5 var2.6 var1.3 var3.6 var2.7
# 1:     -    0      0 1000   1000    0   1000      0   1000      0      0   1000      0      0   1000      0      0      0   1000      0
# 2:     A  254    254    0      0  280      0    254      0    254    280      0    254    280      0    254    254    280      0    254
# 3:     C  230    230    0      0  230      0    230      0    230    230      0    230    230      0    230    230    230      0    230
# 4:     G  236    236    0      0  236      0    236      0    236    236      0    236    236      0    236    236    236      0    236
# 5:     T  280    280    0      0  254      0    280      0    280    254      0    280    254      0    280    280    254      0    280



                              min          lq        mean      median          uq         max neval
   sapply-table          89.77978    89.77978    89.77978    89.77978    89.77978    89.77978     1
   gather-count-spread  849.83078   849.83078   849.83078   849.83078   849.83078   849.83078     1
   dcast-melt         19938.71910 19938.71910 19938.71910 19938.71910 19938.71910 19938.71910     1
   data.table @bk18      46.92746    46.92746    46.92746    46.92746    46.92746    46.92746     1


推荐答案

这大约是您提供的data.table方法速度的两倍,并且应该与数据集的大小很好地缩放:

This is about double the speed of the data.table method you provided, and should scale very well with the size of the dataset:

setDT(datadf)
dcast.data.table(
  melt.data.table(datadf, measure.vars = colnames(datadf))[, .N, .(variable, value)], 
  value ~ variable,
  value.var = "N",
  fill = 0
)

我很想查看整个数据集的基准,因为并非所有这些方法的缩放比例都相似。

I'd be interested to see the benchmarks for your full dataset, because not all of these methods will scale similarly.

这篇关于有效地汇总(计数)列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆