按组检测序列并计算子集的新变量 [英] detecting sequence by group and compute new variable for the subset

查看:86
本文介绍了按组检测序列并计算子集的新变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在data.frame中按组检测序列并计算新变量.

I need to detect a sequence by group in a data.frame and compute new variable.

考虑一下,我有以下data.frame:

df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
              seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
              count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
              product = c("A", "B", "C", "C", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
              stock = c("A", "A,B", "A,B,C", "A,B,C", "A,B,C", "A,B,C", "A,B,C,D", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))

df1

> df1
   ID seqs count product   stock
1   1    1     2       A       A
2   1    2     1       B     A,B
3   1    3     3       C   A,B,C
4   1    4     1       C   A,B,C
5   1    5     1     A,B   A,B,C
6   1    6     2   A,B,C   A,B,C
7   1    7     3       D A,B,C,D
8   2    1     1       A       A
9   2    2     2       B     A,B
10  2    3     1       A     A,B
11  3    1     3       A       A
12  3    2     1   A,B,C   A,B,C
13  3    3     4       D A,B,C,D
14  3    4     1       D A,B,C,D

我有兴趣计算遵循以下顺序的ID度量:

I am interested to compute a measure for ID that follow this sequence:

  - Count == 1
  - Count > 1
  - Count == 1

在此示例中,这适用于:

In the example this is true for:

 - rows 2, 3, 4 for `ID==1`
 - rows 8, 9, 10 for `ID==2`
 - rows 12, 13, 14 for `ID==3`

对于这些ID和行,我需要计算一个称为new的度量,该度量采用序列if的最后一行的product的值,它位于序列的第二行,而不是第一个序列的stock.

For these ID and rows, I need to compute a measure called new that takes the value of the product of the last row of the sequence if it is in the second row of the sequence and NOT in the stock of the first sequence.

所需结果如下所示:

> output
  ID seq1 seq2 seq3 new
1  1    2    3    4   C
2  2    1    2    3    
3  3    2    3    4   D

注意:

Note:

  1. 在检测到ID的顺序中,没有新产品添加到库存中.
  2. 在原始数据中,有许多没有任何序列的ID.
  3. 某些ID具有多个限定序列.全部都应该记录下来.
  4. 计数始终为1或更大.
  5. 原始数据包含数百万个ID,最多1500个序列.
  1. In the sequence detected for ID no new products are added to the stock.
  2. In the original data there are a lot of IDs who do not have any sequences.
  3. Some ID have multiple qualifying sequences. All should be recorded.
  4. Count is always 1 or greater.
  5. The original data holds millions of ID with up to 1500 sequences.

您如何编写一段有效的代码来获得此输出?

How would you write an efficient piece of code to get this output?

推荐答案

以下是data.table选项:

library(data.table)

char_cols <- c("product", "stock")
setDT(df1)[, 
           (char_cols) := lapply(.SD, as.character), 
           .SDcols = char_cols] # in case they're factors
df1[, c1 := (count == 1) & 
            (shift(count) > 1) & 
            (shift(count, 2L) == 1), 
     by = ID] #condition1
df1[, pat := paste0("(", gsub(",", "|", product), ")")] # pattern
df1[, c2 := mapply(grepl, pat, shift(product)) & 
            !mapply(grepl, pat, shift(stock, 2L)), 
    by = ID] # condition2
df1[(c1), new := ifelse(c2, product, "")] # create new column
df1[, paste0("seq", 1:3) := shift(seqs, 2:0)] # create seq columns
df1[(c1), .(ID, seq1, seq2, seq3, new)] # result

这篇关于按组检测序列并计算子集的新变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆