根据其他列 UserID 识别数据序列更改的情况 [英] Identify cases where data sequence changes based on other column UserIDs

查看：52 发布时间：2021/7/19 18:46:42 r group-by sequence multiple-columns rolling-computation

本文介绍了根据其他列 UserID 识别数据序列更改的情况的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理一个数据框 df，如下所示:

I am working on a data frame df which is as below:

TUserId  SUID   mid_sum final_sum
 115      201   2       7     
 115      309   1       8     
 115      404   1       9           
 209      245   2       10    
 209      398   2       10          
 209      510   2       10
 209      602   1       10
 371      111   2       11
 371      115   1       11
 371      123   3       11
 371      124   2       11

1- 我的数据以 wide 格式排列，其中每一行都有一个唯一的学生 ID，显示为 SUID.

1- My data is arranged in a wide format, where each row has a unique student ID shown as SUID.

2- 多个学生可以有相同的老师，因此多行的公共老师 ID 显示为 TUserId.

2- Several students can have the same teacher and hence the common teacher ID across multiple rows shown as TUserId.

3- 数据包括学生期中成绩和学生期末成绩.

3- The data includes student scores in mid-terms and then students' final scores.

4- 我很想知道是否有老师在期中给学生类似的分数(如 mid_sum 所示)在他们的期末考试中给出不一致的分数，如图所示通过 final_sum.如果在数据中发现这种不一致，我想添加一列Status，记录这个不一致.

4- I am interested in finding out if there are any instances where a teacher who gave similar scores to their students on mid-terms as shown by mid_sum gave inconsistent scores on their final exams as shown by final_sum. If such inconsistency is found in data, I want to add a column Status that records this inconsistency.

a- 为此，我的规则是，如果 mid_sum 和 final_sum 按升序排序，就像我在这个示例数据框 df.我想确定升序在这些列 mid_sum 和 final_sum 中的任何一个中中断的情况.


a- For this, my rule is that if mid_sum and final_sum are sorted in ascending order, as I have done in this example data frame df. I want to identify the cases where the ascending sequence breaks in either of these columns mid_sum and final_sum.
b- 如果数据没有排序，可以做吗?
b- Can it be done, if the data is not sorted?
例如，对于 SUID = 309，mid_sum 是前一个 mid_sum 的递减.所以应该标记为inconsistent.它应该只发生在被同一位老师 TUserId 标记的学生身上，在这种情况下是 115.
For example, for SUID = 309, mid_sum is a decrement from the previous mid_sum. So it should be marked as inconsistent. It should only happen for students who were marked by the same teacher TUserId, which in this case is 115.
同样，对于SUID = 602，mid_sum 是从之前的mid_sum 递减.所以应该标记为inconsistent.同样，它是针对同一个老师 TUserId = 209
Similarly,  for SUID = 602, mid_sum is a decrement from the previous mid_sum. So it should be marked as inconsistent. Again, it is for the same teacher TUserId = 209
为了进一步详细说明，我想要这样的输出:
To elaborate further, I want an output like this:
TUserId  SUID   mid_sum final_sum   Status
 115      201   2       7           consistent
 115      309   1       8           inconsistent
 115      404   1       9           consistent
 209      245   2       10          consistent
 209      398   2       10          consistent
 209      510   2       10          consistent
 209      602   1       10          inconsistent
 371      111   2       11          consistent
 371      115   1       11          inconsistent
 371      123   3       11          consistent
 371      124   2       11          inconsistent
 

数据导入dput()
数据框的dput()如下:
dput(df)

structure(list(
TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L), 
SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L), 
mid_sum = c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 2L), 
final_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L)), 
class = "data.frame", row.names = c(NA, -11L))

我在 SO 上寻找了类似的问题，发现了这个 R - 识别连续序列 但它似乎没有帮助我解决我的问题.另一个相关的帖子是确定一个数字序列的时间已经在 R 中被破坏了，但同样，它对我的情况没有帮助.
I looked for similar questions on SO and found this R - identify consecutive sequences but it does not seem to help me address my question.
Another related post was Determine when a sequence of numbers has been broken in R but again, it does not help in my case.
关于如何解决这个问题的任何建议将不胜感激.
Any advice on how to solve this problem would be greatly appreciated.
谢谢！
推荐答案
也许 accumulate 函数系列就是为这些情况设计的.在这里使用 accumulate2 -
Perhaps accumulate family of functions has been designed for these situations.  Using accumulate2 here -
作为第一个参数，我通过 mid_sum
第二个参数是滞后值，即 lag(mid_sum) ，默认为除 NA 之外的任何值和它可能采用的实际值.我认为 0 是安全的
.init 提供任何值.我只选择了 c.
如果第一个参数 (..2) [..1 是累积值而不是第一个参数] 小于 ..3 即第二个参数，返回 不一致 else 一致.
现在由于提供了 .init，结果将比提供的值大一个值，因此删除了它的第一个值 [-1]


As first argument I am passing through mid_sum
second argument is lagged value i.e. lag(mid_sum) with default as any value except NA and actual values it may take.  I am taking 0 as safe
.init is provided with any value.  I chose c only.
if first argument (..2) [..1 is accumulated value and not first arg] is less than ..3 i.e. second argument, return inconsistent else consistent.
Now since .init is provided the results will be one value large than provided, so stripped its first value [-1]

df <- structure(list(
  TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L), 
  SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L), 
  mid_sum = c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 2L), 
  final_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L)), 
  class = "data.frame", row.names = c(NA, -11L))

library(tidyverse)

df %>%
  arrange(TUserId, final_sum) %>%
  group_by(TUserId) %>%
  mutate(status = unlist(accumulate2(mid_sum, lag(mid_sum, default = 0), .init = 'c', 
                              ~ if(..2 < ..3) 'inconsistent' else 'consistent')[-1]))
#> # A tibble: 11 x 5
#> # Groups:   TUserId [3]
#>    TUserId  SUID mid_sum final_sum status      
#>      <int> <int>   <int>     <int> <chr>       
#>  1     115   201       2         7 consistent  
#>  2     115   309       1         8 inconsistent
#>  3     115   404       1         9 consistent  
#>  4     209   245       2        10 consistent  
#>  5     209   398       2        10 consistent  
#>  6     209   510       2        10 consistent  
#>  7     209   602       1        10 inconsistent
#>  8     371   111       2        11 consistent  
#>  9     371   115       1        11 inconsistent
#> 10     371   123       3        11 consistent  
#> 11     371   124       2        11 inconsistent

^{由 reprex 包 (v2.0.0) 于 2021 年 6 月 15 日创建}
^{Created on 2021-06-15 by the reprex package (v2.0.0)}

                        这篇关于根据其他列 UserID 识别数据序列更改的情况的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

根据其他列 UserID 识别数据序列更改的情况 [英] Identify cases where data sequence changes based on other column UserIDs

问题描述

数据导入dput()

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

根据其他列 UserID 识别数据序列更改的情况 [英] Identify cases where data sequence changes based on other column UserIDs

问题描述

数据导入dput()

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭