如果连续行之间的差异满足条件，则计算列的总和 [英] Calculate sum of a column if the difference between consecutive rows meets a condition

查看：18 发布时间：2022/1/9 16:20:38 r group-by sum

本文介绍了如果连续行之间的差异满足条件，则计算列的总和的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

下面是一个示例数据集:

Below is a sample dataset:

df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
       Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
       "6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"), 
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"), 
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>% 
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))

看起来像:

| id |    Date    | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9  | 11/29/2018 |  John | NA   | 959    |
| 9  | 11/29/2018 |  John | 0    | 1158   |
| 9  | 11/29/2018 |  John | 0    | 596    |
| 5  | 2/13/2019  | Maria | 76   | 922    |
| 5  | 2/13/2019  | Maria | 0    | 922    |
| 4  | 6/15/2018  | Sandy | -243 | 1849   |
| 4  | 6/20/2018  | Sandy | 5    | 4193   |
| 4  | 8/17/2018  | Sandy | 58   | 4256   |
| 4  | 8/20/2018  | Sandy | 3    | 65     |
| 4  | 8/23/2018  | Sandy | 3    | 100    |
| 20 | 12/25/2018 | Paul  | 124  | 313    |
| 20 | 12/25/2018 | Paul  | 0    | 99     |

我需要保留那些基于每个买家和 ID 的记录，如果两个连续行之间的差异 <=5，则连续行之间的金额总和 >5000.因此，例如，ID 为4"的买家Sandy"在 5 天的时间间隔内分别在2018 年 6 月 15 日"和2018 年 6 月 20 日"有两次交易 1849 和 4193，并且由于这些交易的总和两个金额>5000，输出会有这些记录.而对于同一买家，ID 为4"的买家Sandy"在2018 年 8 月 17 日"、2018 年 8 月 20 日"和2018 年 8 月 23 日"之间有另外 4256、65 和 100 笔交易每个 3 天，但输出不会将这些记录作为此金额的总和 <5000.最终输出如下所示:

I need to retain those records where based on each buyer and id, the sum of amount between consecutive rows >5000 if the difference between two consecutive rows <=5. So, for example, Buyer 'Sandy' with id '4' has two transactions of 1849 and 4193 on '6/15/2018' and '6/20/2018' within a gap of 5 days, and since the sum of these two amounts>5000, the output would have these records. Whereas, for the same Buyer 'Sandy' with id '4' has another transactions of 4256, 65 and 100 on '8/17/2018', '8/20/2018' and '8/23/2018' within a gap of 3 days each, but the output will not have these records as the sum of this amount <5000. The final output would look like:

| id |    Date   | Buyer | diff | Amount |
|----|:---------:|------:|------|--------|
| 4  | 6/15/2018 | Sandy | -243 | 1849   |
| 4  | 6/20/2018 | Sandy | 5    | 4193   |

推荐答案

我会结合使用 tidyverse 中可用的技术:

I would use a combination of techniques available in tidyverse:

首先创建一个分组变量(new_id)，将原来的id和new_id结合起来，基于一个分组相加.然后我们可以根据Amount > 5000之和的条件filter.我们可以把这个和filter然后join 或 semi_join 以根据条件进行过滤.

First create a grouping variable (new_id) and use the original id and new_id in combination to add together based on a grouping. Then we can filter by the criteria of the sum of the Amount > 5000. We can take this and filter then join or semi_join to filter based on the criteria.

ids 是一个数据集，它根据 id 和 new_id 和 filter 找到总 Amounts for when Dollars >5000.这将为您提供满足您条件的 id 和 new_id


ids is a dataset that finds the total Amount based on id and new_id and filters for when Dollars > 5000. This gives you the id and new_id that meets your criteria
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
                 Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
                        "6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"), 
                 Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"), 
                 Amount= c(959,1158,596,922,922,1849,4193,4256,65,100,313,99), stringsAsFactors = F) %>% 
  group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y")))) 


library(tidyverse)

df1 <- df %>% mutate(Date      = as.Date(Date , format = "%m/%d/%Y"), 
                     tf1       = (id != lag(id, default = 0)),
                     tf2       = (is.na(diffs) | diffs > 5))

df1$new_id <- cumsum(df1$tf1 + df1$tf2 > 0)

>df1
       id    Date       Buyer Amount diffs days_post  tf1   tf2   new_id
       <chr> <date>     <chr>  <dbl> <dbl> <date>     <lgl> <lgl>  <int>
     1 9     2018-11-29 John     959    NA 2018-12-04 TRUE  TRUE       1
     2 9     2018-11-29 John    1158     0 2018-12-04 FALSE FALSE      1
     3 9     2018-11-29 John     596     0 2018-12-04 FALSE FALSE      1
     4 5     2019-02-13 Maria    922    NA 2019-02-18 TRUE  TRUE       2
     5 5     2019-02-13 Maria    922     0 2019-02-18 FALSE FALSE      2
     6 4     2018-06-15 Sandy   1849    NA 2018-06-20 TRUE  TRUE       3
     7 4     2018-06-20 Sandy   4193     5 2018-06-25 FALSE FALSE      3
     8 4     2018-08-17 Sandy   4256    58 2018-08-22 FALSE TRUE       4
     9 4     2018-08-20 Sandy     65     3 2018-08-25 FALSE FALSE      4
    10 4     2018-08-23 Sandy    100     3 2018-08-28 FALSE FALSE      4
    11 20    2018-12-25 Paul     313    NA 2018-12-30 TRUE  TRUE       5
    12 20    2018-12-25 Paul      99     0 2018-12-30 FALSE FALSE      5

ids <- df1 %>% 
       group_by(id, new_id) %>% 
       summarise(dollar = sum(Amount)) %>% 
       ungroup() %>% filter(dollar > 5000)
  id   new_id  dollar
 <chr>  <int>   <dbl>
1 4         3    6042
df1 %>% semi_join(ids)


                        这篇关于如果连续行之间的差异满足条件，则计算列的总和的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如果连续行之间的差异满足条件，则计算列的总和 [英] Calculate sum of a column if the difference between consecutive rows meets a condition

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如果连续行之间的差异满足条件，则计算列的总和 [英] Calculate sum of a column if the difference between consecutive rows meets a condition

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭