如何提取每个组的前n行并使用该子集计算函数? [英] How to extract first n rows per group and calculate function using that subset?

查看:118
本文介绍了如何提取每个组的前n行并使用该子集计算函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题与此非常相似:
如何提取每个组的前n行?

My question is very similar to this one: How to extract the first n rows per group?

dt
         date age     name       val
1: 2000-01-01   3   Andrew  93.73546
2: 2000-01-01   4      Ben 101.83643
3: 2000-01-01   5  Charlie  91.64371
4: 2000-01-02   6     Adam 115.95281
5: 2000-01-02   7      Bob 103.29508
6: 2000-01-02   8 Campbell  91.79532

我们有一个 dt ,我添加了一个名为 val 。首先,我们要提取每个组中的前n行。
提供的链接中的解决方案是:

We have a dt and I've added an extra column named val. First, we want to extract the first n rows within each group. The solutions from the link provided are:

dt[, .SD[1:2], by=date] # where 1:2 is the index needed
dt[dt[, .I[1:2], by = date]$V1] # for speed

我的问题是,如果该函数取决于子集信息,那么该如何将函数应用于每个组中的前n行。我试图应用这样的东西:

My question is how do I apply a function to the first n rows within each group if that function depends on the subsetted information. I am trying to apply something like this:

  # uses other columns for results/ is dependent on subsetted rows
  # but keep it simple for replication
do_something <- function(dt){
  res <- ifelse(cumsum(dt$val) > 200, 1, 0)  
  return(res)
}
# first 2 rows of dt by group=date
x <- dt[, .SD[1:2], by=date]
# apply do_something to first 2 rows of dt by group=date
x[, list('age'=age,'name'=name,'val'=val, 'funcVal'= do_something(.SD[1:2])),by=date]

          date age   name       val funcVal
1: 2000-01-01   3 Andrew  93.73546       0
2: 2000-01-01   4    Ben 101.83643       1
3: 2000-01-02   6   Adam 115.95281       0
4: 2000-01-02   7    Bob 103.29508       1

我要解决这个问题吗?有更有效的方法吗?我似乎无法弄清楚如何为此应用速度解决方案。有没有一种方法,而不必先保存子集处理的结果并立即按日期将函数应用到前两行?

Am I going about this wrong? Is there a more efficient way to do this? I cannot seem to figure out how to apply the "for speed" solution to this. Is there a way to do this without saving the subset-ed results first and applying the function to the first 2 rows by date right away?

在以下的任何帮助中,我们将不胜感激是产生上述数据的代码:

Any help is appreciated and below is the code to produce the data above:

date <- c("2000-01-01","2000-01-01","2000-01-01",
          "2000-01-02","2000-01-02","2000-01-02")
age <- c(3,4,5,6,7,8)
name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell")
val <- val <- rnorm(6,100,10)
dt <- data.table(date, age, name,val)


推荐答案

如果分组列不止一个,则将其折叠为一个可能更有效:

In case there's more than one grouping column, it might be more efficient to collapse to one:

m = dt[, .(g = .GRP, r = .I[1:2]), by = date]
dt[m$r, v := ff(.SD), by=m$g, .SDcols="val"]

这只是对 @eddi的方法(保持行号 .I的扩展)。 ,请参见@akrun的答案中的n)也要保持组计数器 .GRP

This is just an extension to @eddi's approach (of keeping row numbers .I, seen in @akrun's answer) to also keep group counter .GRP.

Re OP的评论是,他们更加关注该功能,好吧,它是从@akrun借来的,... ...

Re OP's comment that they're more concerned about the function, well, borrowing from @akrun, there's ...

ff = function(x) as.integer(cumsum(x[[1]]) > 200)

假设所有值均为非负数,则由于累积达到阈值后,总和可以停止。不过,对于两行的特殊情况,这几乎没有关系。

Assuming all values are nonnegative, you could probably handle this in C more efficiently, since the cumulative sum can stop as soon as the threshold is reached. For the special case of two rows, that will hardly matter, though.

我的印象是,这是一个伪函数,因此毫无意义。我通常想到的许多效率改进取决于功能和数据。

My impression is that this is a dummy function so there's no point going there. Many efficiency improvements that I usually think of are contingent on the function and data.

这篇关于如何提取每个组的前n行并使用该子集计算函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆