在组内对值更改前后的值进行计数,为每个唯一的移位生成新变量 [英] counting values after and before change in value, within groups, generating new variables for each unique shift

查看:76
本文介绍了在组内对值更改前后的值进行计数,为每个唯一的移位生成新变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种方法,在 id 组中,计算在 TF 中出现的价值变动的唯一发生次数数据数据 tbl

I am looking for a way to, within id groups, count unique occurrences of value shifts in TF in the data datatbl.

我想从 TF开始计算向前和向后 1 0 o 1 。计数将存储在新变量 PM ## 中,以便 PM ## s保持每个唯一在 TF 中加减。下面的MWE导致7 PM的结果,但是我的生产数据可以有15个或更多班次。如果 TF 的值在 NA 之间没有变化,我想将其标记为 0

I want to count both forward and backwards from when TF changes between 1 and 0 or o and 1. The counting is to be stored in a new variable PM##, so that the PM##s holds each unique shift in TF, in both plus and minus. The MWE below leads to an outcome with 7 PM, but my production data can have 15 or more shifts. If a TF values does not change between NA's I want to mark it 0.

此问题类似于,但最后一部分是关于 TF 的是新的。 Uwe Psidom a>使用 data.table 此处为初始问题提供了简洁的答案>,并使用 tidyverse 此处。与Awe会议后 ,我正在发布这个问题的经过修改的版本。

This question is similar to a question I previously asked, but the last part about TF standing alone is new. Both Uwe and Psidom provided elegant answers to the initial question using data.table here and using tidyverse here. after conferencing with Uwe, I am posting this slightly modified version of my question.


如果此问题违反了任何SO政策,请告知我,并我很乐意重新回答我的第一个问题,或者在这个问题上附加一个赏金问题。

If this question violates any SO policies please let me know and I'll be happy to reopen my initial question or append this an bounty-issue.

来说明我的问题最小的工作示例。我有这样的数据,

# install.packages(c("tidyverse"), dependencies = TRUE)
library(tibble)
tbl <- tibble(id = c(rep(10L, 17L), rep(0L, 13L), rep(1L, 10L)), 
       TF = c(NA, NA, 0, NA, 0, NA, 1, 1, 1, 1, 1, NA, 1, 0, 1, 0, 1, NA, 0L, NA, 0L, 
       0L, 1L, 1L, 1L, 0L, 0L, NA, NA, 0L, NA, 0L, 0L, 0L, 1L, 1L, 1L, 0L, NA, 1L))
tbl %>% print(n=18)
#> # A tibble: 40 x 2
#>       id    TF
#>    <int> <dbl>
#>  1    10    NA
#>  2    10    NA
#>  3    10     0
#>  4    10    NA
#>  5    10     0
#>  6    10    NA
#>  7    10     1
#>  8    10     1
#>  9    10     1
#> 10    10     1
#> 11    10     1
#> 12    10    NA
#> 13    10     1
#> 14    10     0
#> 15    10     1
#> 16    10     0
#> 17    10     1
#> 18     0    NA
#> # ... with 22 more rows



我要获取的内容,



what I am trying to obtain,

tblPM <- structure(list(id = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), TF = c(NA, NA, 0, NA, 0, NA, 1, 1, 1, 1, 1, 
NA, 1, 0, 1, 0, 1, NA, 0, NA, 0, 0, 1, 1, 1, 0, 0, 
NA, NA, 0, NA, 0, 0, 0, 1, 1, 1, 0, NA, 1), PM01 = c(NA, 
NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -3L, 
-2L, -1L, 1L, 2L, 3L, NA, NA, NA), PM02 = c(NA, NA, NA, NA, 0L, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -2L, 
-1L, 1L, 2L, 3L, NA, NA, NA, NA, NA, NA, NA, NA, NA, -3L, -2L, 
-1L, 1L, NA, NA), PM03 = c(NA, NA, NA, NA, NA, NA, 0L, 0L, 0L, 
0L, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -3L, -2L, 
-1L, 1L, 2L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
0L), PM04 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
-1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), PM05 = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -1L, 1L, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA), PM06 = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, -1L, 1L, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA), PM07 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, -1L, 1L, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
)), .Names = c("id", "TF", "PM01", "PM02", "PM03", "PM04", "PM05", 
"PM06", "PM07"), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -40L
))


tblPM %>% print(n=18)  
#> # A tibble: 40 x 9
#>       id    TF  PM01  PM02  PM03  PM04  PM05  PM06  PM07
#>    <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
#>  1    10    NA    NA    NA    NA    NA    NA    NA    NA
#>  2    10    NA    NA    NA    NA    NA    NA    NA    NA
#>  3    10     0     0    NA    NA    NA    NA    NA    NA
#>  4    10    NA    NA    NA    NA    NA    NA    NA    NA
#>  5    10     0    NA     0    NA    NA    NA    NA    NA
#>  6    10    NA    NA    NA    NA    NA    NA    NA    NA
#>  7    10     1    NA    NA     0    NA    NA    NA    NA
#>  8    10     1    NA    NA     0    NA    NA    NA    NA
#>  9    10     1    NA    NA     0    NA    NA    NA    NA
#> 10    10     1    NA    NA     0    NA    NA    NA    NA
#> 11    10     1    NA    NA     0    NA    NA    NA    NA
#> 12    10    NA    NA    NA    NA    NA    NA    NA    NA
#> 13    10     1    NA    NA    NA    -1    NA    NA    NA
#> 14    10     0    NA    NA    NA     1    -1    NA    NA
#> 15    10     1    NA    NA    NA    NA     1    -1    NA
#> 16    10     0    NA    NA    NA    NA    NA     1    -1
#> 17    10     1    NA    NA    NA    NA    NA    NA     1
#> 18     0    NA    NA    NA    NA    NA    NA    NA    NA
#> # ... with 22 more rows 

identical([some solution], tblPM)
#> [1] TRUE

更新带有微基准测试 2018-01-24 14:20:18Z

update w/ microbenchmark 2018-01-24 14:20:18Z,

感谢Fierr和Chris抽出宝贵时间来梳理逻辑并提交回答。启发了我的此设置,我对这些功能进行了微基准测试。我将Fierr 的答案放入函数 tidyverse_Fierr(),将克里斯的答案放入 dt_Chris()`(如果有人愿意的话)确切的功能,请让我知道,我将在这里添加它们。

Thanks to Fierr and Chris for taking the time to tease out the logic and submit an answer. Inspired my this setup I've computed a small microbenchmark comparison of thier functions. I put Fierrs answer into the functiontidyverse_Fierr()and Chris' answer intodt_Chris()` (if someone want the exact functions please let me know and I'll add them here.

在进行一些细微调整之后,当它们与 tblPM ,即

After some minor tweaks they both come out identical when match with tblPM, i.e.

identical(tblPM, tidyverse_Fierr(tbl))
#> [1] TRUE
identical(tblPM, dt_Chris(tbl))
#> [1] TRUE

现在有了快速的微基准测试,

Now to the quick microbenchmark,

df_test <- bind_rows(rep(list(tbl), 111))
microbenchmark::microbenchmark(tidyverse_Fierr(df_test), dt_Chris(df_test), times = 3*1)
#> Unit: milliseconds
#>                      expr      min       mean   median        uq         max neval cld
#> tidyverse_Fierr(df_test) 19503.366  20171.268 20080.99 20505.219  20929.4489     3   b
#>        dt_Chris(df_test)   199.165    233.924   203.72   251.304    298.8887     3   a 

有趣的是,tidy_method在此类似的比较

Interestingly the tidy_method comes out way faster in this kinda similar comparison.

推荐答案

以下是一种脚本方法-给出了每个案例的自定义处理量(TF = NA,uniqueN(TF)= 1,uniqueN(TF)= 2,我认为与dplyr链相比,这可能更容易实现)。应该相当快,因为​​它全部基于data.table。公开征求有关如何改进的建议!

Here is a script approach - given the amount of custom treatment for each case (TF = NA, uniqueN(TF) = 1, uniqueN(TF) = 2, I think this is likely clearer to implement vs. a dplyr chain. Should be fairly quick as it is all data.table based. Open to suggestions on how to improve!

这将随着所​​需PM列数的增加而自动扩展-正如我在下面的评论中,我建议去除0前缀在该列中,因为可能会出现10 ^ 2..n列会撞到PM001的情况。

This will expand automatically as the number of PM columns required increases - as I commented below, I would recommend getting rid of the 0 prefix in the column, as there may be a case where you get to 10^2..n columns which would bump to PM001.

library(data.table)
tbl3 <- data.table(id = c(rep(10L, 17L), rep(0L, 13L), rep(1L, 10L)), 
                   TF = c(NA, NA, 0L, NA, 0L, NA, 1L, 1L, 1L, 1L, 1L, NA, 1L, 0L, 1L, 0L, 1L, NA, 0L, NA, 0L, 
                          0L, 1L, 1L, 1L, 0L, 0L, NA, NA, 0L, NA, 0L, 0L, 0L, 1L, 1L, 1L, 0L, NA, 1L))


# create index to untimately join back to
tbl3[, row_idx := .I]

# all transformations on a replicated data.table
tbl3_tmp <- copy(tbl3)

# identify where the NA breaks occur - this splits each id into subgroups (id_group)
tbl3_tmp[, P_TF := shift(TF, 1, "lag", fill = NA), by = .(id)]
tbl3_tmp[, TF_break := is.na(TF) | is.na(P_TF)]
tbl3_tmp[, id_group := cumsum(TF_break), by = .(id)]

tbl3_tmp[, `:=`(TF_break = NULL, P_TF = NULL)] # above can be consolidated to one line which would make this line unneccesary - expanded for easier understanding
tbl3_tmp <- tbl3_tmp[!is.na(TF)] # NA rows can be safely ignored now - these will be all NA, and will be handled with the left join below

# find where subpatterns exist (runs of 0..1 or 1..0)
tbl3_tmp[, subpattern_break := TF != shift(TF, 1, "lag", fill = NA), by = .(id, id_group)]
tbl3_tmp[, subbreaks := sum(subpattern_break, na.rm = TRUE), by = .(id, id_group)] # if there are no breaks, we need to treat separately

# two cases: zero subbreaks and multiple subbreaks. 
tbl3_zeros <- tbl3_tmp[subbreaks == 0]
tbl3_nonzeros <- tbl3_tmp[subbreaks > 0]

# for 1+ subbreaks, we need to double the rows - this allows us to easily create the PM_field both "forwards" and "backwards"
tbl3_nonzeros[is.na(subpattern_break), subpattern_break := TRUE]
tbl3_nonzeros[, subbreak_index := cumsum(subpattern_break), by = .(id, id_group)]

tbl3_nonzeros <- rbindlist(list(tbl3_nonzeros,tbl3_nonzeros), idcol = "base") # double the row

tbl3_nonzeros[base == 1 & subbreak_index %% 2 == 1, subbreak_index := subbreak_index + 1L] # round to nearest even
tbl3_nonzeros[base == 2 & subbreak_index %% 2 == 0, subbreak_index := subbreak_index + 1L] # round to nearest odd

# this creates an index when the subbreak starts - allows us to sequence PM properly
tbl3_nonzeros[,subbreak_start := min(row_idx), by = .(id, id_group, subbreak_index)]

# exclude the ends if there is only one unique TF value - might be able to get this to one line
tbl3_nonzeros[, TF_count := uniqueN(TF), by = .(id, id_group, subbreak_index)]
tbl3_nonzeros <- tbl3_nonzeros[TF_count > 1]

# create a 1..N column, subtract the index where the break occurs ,then add 1 to all 0+ values.
tbl3_nonzeros[,PM_field := 1:.N, by = .(id, id_group, subbreak_index)]
tbl3_nonzeros[, PM_field := PM_field - PM_field[which(diff(TF)!=0)[1]+1], by = .(id, id_group, subbreak_index)]
tbl3_nonzeros[PM_field >= 0, PM_field := PM_field + 1L] # base 1 after the break

# create subbreaks for zero groups
tbl3_zeros[,subbreak_start := min(row_idx), by = .(id, id_group)]

# bring zero and non zero case together
tbl3_zeros <- tbl3_zeros[, .(id, id_group, subbreak_start,row_idx = row_idx, PM_field = 0L)]
tbl3_nonzeros <- tbl3_nonzeros[,.(id, id_group, subbreak_start, row_idx, PM_field)]
tbl3_tmp <- rbindlist(list(tbl3_zeros, tbl3_nonzeros))

# Create header
tbl3_tmp <- tbl3_tmp[order(subbreak_start, PM_field)] 
tbl3_tmp[, PM_header := paste0("PM0",cumsum(c(1,diff(subbreak_start)!=0)),sep = ""), by = .(id)] # I would remove 0 in PM0 here (kept for identical check)- inefficient to check if this will be 1, 2, 3 etc digits This could also be solved with; `paste0("PM", sprintf("%02d", cumsum(c(1, diff(subbreak_start) != 0))))`

# long to wide
tbl3_tmp <- dcast(tbl3_tmp, row_idx ~ PM_header, value.var = "PM_field", fun.aggregate = sum, fill = NA)

# merge back to initial dataframe
tblPM_frombase <- merge(tbl3, tbl3_tmp, by = "row_idx", all.x = TRUE)[, row_idx := NULL]

identical(tblPM, tblPM_frombase)
[1] TRUE

这篇关于在组内对值更改前后的值进行计数,为每个唯一的移位生成新变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆