按r中的连续值分组 [英] group by consecutive values in r

查看：90 发布时间：2020/10/26 4:37:55 r dplyr

本文介绍了按r中的连续值分组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个来自支持票务系统的数据集，该数据集记录了代理商在分类和响应客户请求时所进行的每次点击。系统为每次单击分配一个新的hist_id，但是代理将单击多个字段，从而触发表中的多个行，它们将它们视为单个交互。

I've got a dataset coming from a support ticketing system that logs each click made by an agent in classifying and responding to customer requests. The system assigns a new hist_id to each click, but an agent will click several fields, triggering several rows in the table, in what they consider a single "interaction".

我的目标是通过对每个组中的第一个和最后一个Modify_time值进行比较来计算每个交互的处理时间。

My goal is to calculate a handle time for each of these interaction by doing a diff on the first and last modify_time values in each group.

我目前处于停滞状态，因为代理人整天与案件有多次互动。

I'm stuck currently because an agent will have multiple interactions with a case throughout the day.

下面是一个示例数据框：

Here's a sample dataframe:

hist_id <- c(1234, 2345, 3456, 4567, 5678, 6789, 7890)
case_id <- c(1, 1, 1, 1, 1, 1, 1)
agent_name <- c("John", "John", "John", "Paul", "Paul", "John", "John")
modify_time <- as.POSIXct(c(1510095120, 1510095180, 1510095240, 1510098600, 1510098720, 1510135200, 1510135320), origin = "1970-01-01")
df <- data.frame(hist_id, case_id, agent_name, modify_time)

按case_id和agent_name使用group by将符合条件的所有行分组，如预期：

Using group by on the case_id and agent_name groups all rows that match the criteria, as expected:

df %>% group_by(case_id, agent_name) %>% mutate(first = first(modify_time), last = last(modify_time), diff = min(difftime(last, first)))

哪个给我这个：

    # A tibble: 7 x 7
# Groups:   case_id, agent_name [2]
  hist_id case_id agent_name         modify_time               first                last       diff
    <dbl>   <dbl>     <fctr>              <dttm>              <dttm>              <dttm>     <time>
1    1234       1       John 2017-11-07 16:52:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
2    2345       1       John 2017-11-07 16:53:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
3    3456       1       John 2017-11-07 16:54:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
4    4567       1       Paul 2017-11-07 17:50:00 2017-11-07 17:50:00 2017-11-07 17:52:00   120 secs
5    5678       1       Paul 2017-11-07 17:52:00 2017-11-07 17:50:00 2017-11-07 17:52:00   120 secs
6    6789       1       John 2017-11-08 04:00:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs
7    7890       1       John 2017-11-08 04:02:00 2017-11-07 16:52:00 2017-11-08 04:02:00 40200 secs

返回约翰真实的第一次和最后一次modify_times的位置。但是，我需要将case_id和agent_name的连续匹配分组，以便考虑Paul的互动。因此，这里记录了三种交互：一种来自约翰，一种来自保罗，另一种来自约翰。

Where John's true first and last modify_times are returned. However, I need to group the consecutive matches of case_id and agent_name, so that Paul's interaction is considered. So three interactions are recorded here: one from John, one from Paul, and a second by John.

所需的输出将是这样的：

Desired output would be something like this:

    # A tibble: 7 x 7
# Groups:   case_id, agent_name [2]
  hist_id case_id agent_name         modify_time               first                last       diff
    <dbl>   <dbl>     <fctr>              <dttm>              <dttm>              <dttm>     <time>
1    1234       1       John 2017-11-07 16:52:00 2017-11-07 16:52:00 2017-11-07 16:54:00 120 secs
2    2345       1       John 2017-11-07 16:53:00 2017-11-07 16:52:00 2017-11-07 16:54:00 120 secs
3    3456       1       John 2017-11-07 16:54:00 2017-11-07 16:52:00 2017-11-07 16:54:00 120 secs
4    4567       1       Paul 2017-11-07 17:50:00 2017-11-07 17:50:00 2017-11-07 17:52:00 120 secs
5    5678       1       Paul 2017-11-07 17:52:00 2017-11-07 17:50:00 2017-11-07 17:52:00 120 secs
6    6789       1       John 2017-11-08 04:00:00 2017-11-08 04:00:00 2017-11-08 04:02:00 120 secs
7    7890       1       John 2017-11-08 04:02:00 2017-11-08 04:00:00 2017-11-08 04:02:00 120 secs

推荐答案

这是一种整洁的方法，它按处理群集标识对组进行分区，以及 case_id 和 agent_name ：

Here is a tidyverse approach that partitions the groups by the processing cluster identity, as well as case_id, and agent_name:

安排所有点击在顺序e，每次 hist_id 序列遇到过渡到新的 agent_name 时，都生成一个新的id标志。这些标记 cumsum 会在每种情况下，每个代理程序，每个集群处理块中生成唯一的 prcl_id 。使用所有三个ID，您就可以在所需的分区中运行所选的突变。

Arranging all the click in sequence, generate a new id flag for each time that a hist_id sequence encounters a transition to a new agent_name. cumsum those flags to generate a unique prcl_id per case, per agent, per cluster processing chunk. With all three id's you can then run your chosen mutations within the desired partitions.

df %>% 
    arrange(hist_id) %>%  # to ensure there are no wrinkles
    mutate(ag_chg_flg = ifelse(lag(agent_name) != agent_name, 1, 0) %>%
               coalesce(0) # to reassign the first click in a case_id to 0 (from NA)
           ) %>% 
    group_by(case_id, agent_name) %>%  
    mutate(prcl_id = cumsum(ag_chg_flg) + 1) %>%  # generate the proc_clst_id (starting at 1) 
    group_by(case_id, agent_name, prcl_id) %>%  # group by the complete composite id
    mutate(first = first(modify_time),
           last = last(modify_time),
           diff = min(difftime(last, first))
           )

哪个会得到你：

# A tibble: 7 x 9
# Groups:   case_id, agent_name, prcl_id [3]
  hist_id case_id agent_name         modify_time ag_chg_flg prcl_id               first                last   diff
    <dbl>   <dbl>     <fctr>              <dttm>      <dbl>   <dbl>              <dttm>              <dttm> <time>
1    1234       1       John 2017-11-07 14:52:00          0       1 2017-11-07 14:52:00 2017-11-07 14:54:00 2 mins
2    2345       1       John 2017-11-07 14:53:00          0       1 2017-11-07 14:52:00 2017-11-07 14:54:00 2 mins
3    3456       1       John 2017-11-07 14:54:00          0       1 2017-11-07 14:52:00 2017-11-07 14:54:00 2 mins
4    4567       1       Paul 2017-11-07 15:50:00          1       2 2017-11-07 15:50:00 2017-11-07 15:52:00 2 mins
5    5678       1       Paul 2017-11-07 15:52:00          0       2 2017-11-07 15:50:00 2017-11-07 15:52:00 2 mins
6    6789       1       John 2017-11-08 02:00:00          1       2 2017-11-08 02:00:00 2017-11-08 02:02:00 2 mins
7    7890       1       John 2017-11-08 02:02:00          0       2 2017-11-08 02:00:00 2017-11-08 02:02:00 2 mins

这篇关于按r中的连续值分组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

按r中的连续值分组 [英] group by consecutive values in r

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

按r中的连续值分组 [英] group by consecutive values in r

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭