各组的最新值之和 [英] Sum of most recent values across groups

查看：64 发布时间：2020/10/15 19:44:22 r data.table

本文介绍了各组的最新值之和的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于我的数据的每一行，我想为每个组计算最新的值 ：

For each row of my data I'd like to compute the sum of most recent value for each group:

dt = data.table(group = c('a','b','a','a','b','a'),
                value = c(10, 5, 20, 15, 15, 10),
                desired = c(10, 15, 25, 20, 30, 25))
#   group value desired
#1:     a    10      10
#2:     b     5      15
#3:     a    20      25  # latest value of a is 20, of b is 5
#4:     a    15      20  # latest value of a is 15, of b is 5
#5:     b    15      30
#6:     a    10      25

期望列是我想要实现的，我可以做到幼稚的循环，但是我的数据很大，有很多行和组（1M +行，1000+组）。

desired column is what I want to achieve, and I can do this with a naive loop, but my data is quite large with a lot of rows and groups (1M+ rows, 1000+ groups).

for (i in seq_len(nrow(dt))) {
  # can use `set` to make this faster, but still too slow
  # this is just to illustrate *a* solution
  dt[i, desired1 := dt[1:i, value[.N], by = group][, sum(V1)]]
}

推荐答案

来自@eddi的更简单的逻辑（在注释中）减少了如下所示的回旋处：

Even simpler logic from @eddi (under comments) reducing the roundabout one shown below:

dt[, incr := diff(c(0, value)), by = group][, ans := cumsum(incr)]

不确定如何扩展到更多组，但是这里有3个组的示例数据：

Not sure how it extends to more groups, but here's on an example data with 3 groups:

# I hope I got the desired output correctly
require(data.table)
dt = data.table(group = c('a','b','c','a','a','b','c','a'),
                value = c(10, 5, 20, 25, 15, 15, 30, 10),
                desired = c(10, 15, 35, 50, 40, 50, 60, 55))

添加 rleid ：

dt[, id := rleid(group)]

提取每个<$的最后一行c $ c> group，id ：

last = dt[, .(value=value[.N]), by=.(group, id)]

last 将具有唯一的 id 。现在的想法是获取每个 id 的增量，然后重新加入并更新。

last will have unique id. Now the idea is to get the increment for each id, and then join+update back.

last = last[, incr := value - shift(value, type="lag", fill=0L), by=group
          ][, incr := cumsum(incr)-value][]

立即加入+更新：

dt[last, ans := value + i.incr, on="id"][, id := NULL][]
#    group value desired ans
# 1:     a    10      10  10
# 2:     b     5      15  15
# 3:     c    20      35  35
# 4:     a    25      50  50
# 5:     a    15      40  40
# 6:     b    15      50  50
# 7:     c    30      60  60
# 8:     a    10      55  55

我不确定在哪里/如果出现这种情况，现在将仔细检查。我立即写了它，以便引起更多关注。

I'm not yet sure where/if this breaks.. will look at it carefully now. I wrote it immediately so that there are more eyes on it.

使用David的解决方案比较500个具有10,000行的组：

Comparing on 500 groups with 10,000 rows with David's solution:

require(data.table)
set.seed(45L)
groups = apply(matrix(sample(letters, 500L*10L, TRUE), ncol=10L), 1L, paste, collapse="")
uniqueN(groups) # 500L
N = 1e4L
dt = data.table(group=sample(groups, N, TRUE), value = sample(100L, N, TRUE))

arun <- function(dt) {

    dt[, id := rleid(group)]
    last = dt[, .(value=value[.N]), by=.(group, id)]
    last = last[, incr := value - shift(value, type="lag", fill=0L), by=group
              ][, incr := cumsum(incr)-value][]
    dt[last, ans := value + i.incr, on="id"][, id := NULL][]
    dt$ans
}

david <- function(dt) {
    dt[, indx := .I]
    res <- dcast(dt, indx ~ group)
    for (j in names(res)[-1L]) 
        set(res, j = j, value = res[!is.na(res[[j]])][res, on = "indx", roll = TRUE][[j]])
    rowSums(as.matrix(res)[, -1], na.rm = TRUE)

}

system.time(ans1 <- arun(dt))  ## 0.024s
system.time(ans2 <- david(dt)) ## 38.97s 
identical(ans1, as.integer(ans2))
# [1] TRUE

这篇关于各组的最新值之和的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

各组的最新值之和 [英] Sum of most recent values across groups

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

各组的最新值之和 [英] Sum of most recent values across groups

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭