根据一个变量运行定义序列，该变量具有另一个变量的附加条件 [英] Define sequences based on a variable run with additional condition from another variable

查看：72 发布时间：2020/10/26 5:02:16 r dplyr tidyverse

本文介绍了根据一个变量运行定义序列，该变量具有另一个变量的附加条件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

structure(list(group = c(NA, "A", "B", NA, "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", NA, NA, "B", "B", "A", "A", NA, NA, "B", "B", "B", NA, "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", NA, NA, "B", "B", 
NA, "A"), seq_break = c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, 
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)), .Names = c("group", 
"seq_break"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-50L))

在上面的数据中，我需要定义一个包含游程长度类型ID为的列组列（如 data.table :: rleid 产生，但忽略<< c $ c> NA ）。如您所见，我们还有 seq_break 列，该列应结束一个序列。通常会这样，如 group = NA 然后 seq_break = TRUE 。但有时 seq_break = TRUE 且组为 A 或 B -然后，即使下一行引用了同一组，也应结束该序列并开始新的序列。因此，例如对于行 25:26 ，即使两个事件都引用组 B ，我们也应具有两个不同的序列ID。。通常，预期输出如下所示：

In the data above I need to define a column that would contain run-length type id of group column (like data.table::rleid produces, but ignoring NA). As you can observe, we've got also the column seq_break which should end a sequence. And it usually does, as when group = NA then seq_break = TRUE. But sometimes seq_break = TRUE and group is A or B - then, the sequence should be ended and new one started even if the next row refers to the same group. So for example for rows 25:26 we should have two different sequence id, even though both events refers to group B. Generally, the expected output is shown below:

structure(list(group = c(NA, "A", "B", NA, "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", NA, NA, "B", "B", "A", "A", NA, NA, "B", "B", "B", NA, "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", NA, NA, "B", "B", 
NA, "A"), seq_break = c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, 
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE), expected_output = c(NA, 
1, 2, NA, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
3, NA, NA, 4, 5, 6, 6, NA, NA, 7, 7, 7, NA, 8, 8, 8, 8, 8, 8, 
8, 8, 8, 8, NA, NA, 11, 11, NA, 12)), .Names = c("group", "seq_break", 
"expected_output"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-50L))

如何使用 tidyverse 实现？

推荐答案

使用 tidyverse 和 data.table 的解决方案。假设 dt1 是您的示例数据框，而 dt3 是最终输出。请注意，我认为在预期输出中，第47至48行应为9，第50行应为10。我不确定为什么在您的预期输出中，行47至48为11而第50行为12。

A solution using tidyverse and data.table. Assuming dt1 is your example data frame and dt3 is the final output. Notice that I think in the expected output, row 47 to 48 should be 9, and row 50 should be 10. I am not sure why in your expected output row 47 to 48 is 11 and row 50 is 12.

library(tidyverse)
library(data.table)

dt2 <- dt1 %>% rowid_to_column() 

dt3 <- dt2 %>%
  mutate(ID = rleid(group, seq_break)) %>%
  group_by(group, seq_break, ID) %>%
  filter(!(is.na(group) & seq_break & row_number() > 1)) %>%
  ungroup() %>%
  mutate(ID2 = cumsum(seq_break)) %>%
  drop_na(group) %>%
  mutate(expected_output = rleid(group, ID2)) %>%
  select(rowid, expected_output) %>%
  left_join(dt2, ., by = "rowid") %>%
  select(-rowid)

dt3
# # A tibble: 50 x 3
#    group seq_break expected_output
#    <chr> <lgl>               <int>
#  1 NA    TRUE                   NA
#  2 A     FALSE                   1
#  3 B     FALSE                   2
#  4 NA    TRUE                   NA
#  5 B     FALSE                   3
#  6 B     FALSE                   3
#  7 B     FALSE                   3
#  8 B     FALSE                   3
#  9 B     FALSE                   3
# 10 B     FALSE                   3
# # ... with 40 more rows

这篇关于根据一个变量运行定义序列，该变量具有另一个变量的附加条件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

根据一个变量运行定义序列，该变量具有另一个变量的附加条件 [英] Define sequences based on a variable run with additional condition from another variable

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

根据一个变量运行定义序列，该变量具有另一个变量的附加条件 [英] Define sequences based on a variable run with additional condition from another variable

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭