data.table 按行求和,平均值,最小值,最大值,如 dplyr? [英] data.table row-wise sum, mean, min, max like dplyr?

查看:15
本文介绍了data.table 按行求和,平均值,最小值,最大值,如 dplyr?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

还有其他关于数据表上的按行运算符的帖子.它们要么 太简单 要么解决了一个 具体场景

There are other posts about row-wise operators on datatable. They are either too simple or solves a specific scenario

我的问题更笼统.有一个使用 dplyr 的解决方案.我玩过,但未能找到使用 data.table 语法的等效解决方案.您能否建议一个优雅的 data.table 解决方案,它可以重现与 dplyr 版本相同的结果?

My question here is more generic. There is a solution using dplyr. I have played around but failed to find a an equivalent solution using data.table syntax. Can you please suggest an elegant data.table solution that reproduce the same results than the dplyr version?

编辑 1:在真实数据集(10MB,73000 行,24 个数字列上进行的统计)的建议解决方案的基准总结.基准测试结果是主观的.但是,经过的时间始终可以重现.

EDIT 1: Summary of benchmarks of the suggested solutions on real dataset (10MB, 73000 rows, stats made on 24 numeric columns). The benchmark results is subjective. However, the elapsed time is consistently reproducible.

| Solution By | Speed compared to dplyr     |
|-------------|-----------------------------|
| Metrics v1  |  4.3 times SLOWER (use .SD) |
| Metrics v2  |  5.6 times FASTER           |
| ExperimenteR| 15   times FASTER           |
| Arun v1     |  3   times FASTER (Map func)|
| Arun v2     |  3   times FASTER (foo func)|
| Ista        |  4.5 times FASTER           |

编辑 2:我在一天后添加了 NACount 列.这就是为什么在各种贡献者建议的解决方案中找不到此专栏的原因.

EDIT 2: I have added NACount column a day after. This is why this column is not found in the solutions suggested by various contributors.

数据设置

library(data.table)
dt <- data.table(ProductName = c("Lettuce", "Beetroot", "Spinach", "Kale", "Carrot"),
    Country = c("CA", "FR", "FR", "CA", "CA"),
    Q1 = c(NA, 61, 40, 54, NA), Q2 = c(22,  8, NA,  5, NA),
    Q3 = c(51, NA, NA, 16, NA), Q4 = c(79, 10, 49, NA, NA))

#    ProductName Country Q1 Q2 Q3 Q4
# 1:     Lettuce      CA NA 22 51 79
# 2:    Beetroot      FR 61  8 NA 10
# 3:     Spinach      FR 40 NA NA 49
# 4:        Kale      CA 54  5 16 NA
# 5:      Carrot      CA NA NA NA NA

使用 dplyr + rowwise() 的解决方案

library(dplyr) ; library(magrittr)
dt %>% rowwise() %>% 
    transmute(ProductName, Country, Q1, Q2, Q3, Q4,
     AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
     MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
     MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
     SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
     NAcnt= sum(is.na(c(Q1, Q2, Q3, Q4))))

#   ProductName Country Q1 Q2 Q3 Q4      AVG MIN  MAX SUM NAcnt
# 1     Lettuce      CA NA 22 51 79 50.66667  22   79 152     1
# 2    Beetroot      FR 61  8 NA 10 26.33333   8   61  79     1
# 3     Spinach      FR 40 NA NA 49 44.50000  40   49  89     2
# 4        Kale      CA 54  5 16 NA 25.00000   5   54  75     1
# 5      Carrot      CA NA NA NA NA      NaN Inf -Inf   0     4

data.table 出现错误(计算整列而不是每行)

dt[, .(ProductName, Country, Q1, Q2, Q3, Q4,
    AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
    MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
    MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
    SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
    NAcnt= sum(is.na(c(Q1, Q2, Q3, Q4))))]

#    ProductName Country Q1 Q2 Q3 Q4      AVG MIN MAX SUM NAcnt
# 1:     Lettuce      CA NA 22 51 79 35.90909   5  79 395     9
# 2:    Beetroot      FR 61  8 NA 10 35.90909   5  79 395     9
# 3:     Spinach      FR 40 NA NA 49 35.90909   5  79 395     9
# 4:        Kale      CA 54  5 16 NA 35.90909   5  79 395     9
# 5:      Carrot      CA NA NA NA NA 35.90909   5  79 395     9

几乎解决方案,但更复杂且缺少 Q1、Q2、Q3、Q4 输出列

dtmelt <- reshape2::melt(dt, id=c("ProductName", "Country"),
            variable.name="Quarter", value.name="Qty")

dtmelt[, .(AVG = mean(Qty, na.rm=TRUE),
    MIN = min (Qty, na.rm=TRUE),
    MAX = max (Qty, na.rm=TRUE),
    SUM = sum (Qty, na.rm=TRUE),
    NAcnt= sum(is.na(Qty))), by = list(ProductName, Country)]

#    ProductName Country      AVG MIN  MAX SUM NAcnt
# 1:     Lettuce      CA 50.66667  22   79 152     1
# 2:    Beetroot      FR 26.33333   8   61  79     1
# 3:     Spinach      FR 44.50000  40   49  89     2
# 4:        Kale      CA 25.00000   5   54  75     1
# 5:      Carrot      CA      NaN Inf -Inf   0     4

推荐答案

您可以使用 matrixStats 包中的高效逐行函数.

You can use an efficient row-wise functions from matrixStats package.

library(matrixStats)
dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
          MAX = rowMaxs(as.matrix(.SD), na.rm=T),
          AVG = rowMeans(.SD, na.rm=T),
          SUM = rowSums(.SD, na.rm=T)), .SDcols=c(Q1, Q2,Q3,Q4)]

dt
#    ProductName Country Q1 Q2 Q3 Q4 MIN  MAX      AVG SUM
# 1:     Lettuce      CA NA 22 51 79  22   79 50.66667 152
# 2:    Beetroot      FR 61  8 NA 10   8   61 26.33333  79
# 3:     Spinach      FR 40 NA 79 49  40   79 56.00000 168
# 4:        Kale      CA 54  5 16 NA   5   54 25.00000  75
# 5:      Carrot      CA NA NA NA NA Inf -Inf      NaN   0

对于 500000 行的数据集(使用来自 CRAN 的 data.table)

For dataset with 500000 rows(using the data.table from CRAN)

dt <- rbindlist(lapply(1:100000, function(i)dt))
system.time(dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
                      MAX = rowMaxs(as.matrix(.SD), na.rm=T),
                      AVG = rowMeans(.SD, na.rm=T),
                      SUM = rowSums(.SD, na.rm=T)), .SDcols=c("Q1", "Q2","Q3","Q4")])
#  user  system elapsed 
# 0.089   0.004   0.093

rowwise(或by=1:nrow(dt))是for循环的委婉说法",如

rowwise (or by=1:nrow(dt)) is "euphemism" for for loop, as exemplified by

library(dplyr) ; library(magrittr)
system.time(dt %>% rowwise() %>% 
  transmute(ProductName, Country, Q1, Q2, Q3, Q4,
            MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
            MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
            AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
            SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE)))
#   user  system elapsed 
# 80.832   0.111  80.974 

system.time(dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c("Q1", "Q2","Q3","Q4"),by=1:nrow(dt)] )
#    user  system elapsed 
# 141.492   0.196 141.757

这篇关于data.table 按行求和,平均值,最小值,最大值,如 dplyr?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆