根据其他列 UserID 识别数据序列更改的情况 [英] Identify cases where data sequence changes based on other column UserIDs

查看:52
本文介绍了根据其他列 UserID 识别数据序列更改的情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个数据框 df,如下所示:

I am working on a data frame df which is as below:

TUserId  SUID   mid_sum final_sum
 115      201   2       7     
 115      309   1       8     
 115      404   1       9           
 209      245   2       10    
 209      398   2       10          
 209      510   2       10
 209      602   1       10
 371      111   2       11
 371      115   1       11
 371      123   3       11
 371      124   2       11

1- 我的数据以 wide 格式排列,其中每一行都有一个唯一的学生 ID,显示为 SUID.

1- My data is arranged in a wide format, where each row has a unique student ID shown as SUID.

2- 多个学生可以有相同的老师,因此多行的公共老师 ID 显示为 TUserId.

2- Several students can have the same teacher and hence the common teacher ID across multiple rows shown as TUserId.

3- 数据包括学生期中成绩和学生期末成绩.

3- The data includes student scores in mid-terms and then students' final scores.

4- 我很想知道是否有老师在期中给学生类似的分数(如 mid_sum 所示)在他们的期末考试中给出不一致的分数,如图所示通过 final_sum.如果在数据中发现这种不一致,我想添加一列Status,记录这个不一致.

4- I am interested in finding out if there are any instances where a teacher who gave similar scores to their students on mid-terms as shown by mid_sum gave inconsistent scores on their final exams as shown by final_sum. If such inconsistency is found in data, I want to add a column Status that records this inconsistency.

a- 为此,我的规则是,如果 mid_sumfinal_sum 按升序排序,就像我在这个示例数据框 df.我想确定升序在这些列 mid_sumfinal_sum 中的任何一个中中断的情况.

a- For this, my rule is that if mid_sum and final_sum are sorted in ascending order, as I have done in this example data frame df. I want to identify the cases where the ascending sequence breaks in either of these columns mid_sum and final_sum.

b- 如果数据没有排序,可以做吗?

b- Can it be done, if the data is not sorted?

例如,对于 SUID = 309mid_sum 是前一个 mid_sum 的递减.所以应该标记为inconsistent.它应该只发生在被同一位老师 TUserId 标记的学生身上,在这种情况下是 115.

For example, for SUID = 309, mid_sum is a decrement from the previous mid_sum. So it should be marked as inconsistent. It should only happen for students who were marked by the same teacher TUserId, which in this case is 115.

同样,对于SUID = 602mid_sum 是从之前的mid_sum 递减.所以应该标记为inconsistent.同样,它是针对同一个老师 TUserId = 209

Similarly, for SUID = 602, mid_sum is a decrement from the previous mid_sum. So it should be marked as inconsistent. Again, it is for the same teacher TUserId = 209

为了进一步详细说明,我想要这样的输出:

To elaborate further, I want an output like this:

TUserId  SUID   mid_sum final_sum   Status
 115      201   2       7           consistent
 115      309   1       8           inconsistent
 115      404   1       9           consistent
 209      245   2       10          consistent
 209      398   2       10          consistent
 209      510   2       10          consistent
 209      602   1       10          inconsistent
 371      111   2       11          consistent
 371      115   1       11          inconsistent
 371      123   3       11          consistent
 371      124   2       11          inconsistent
 

数据导入dput()

数据框的dput()如下:

dput(df)

structure(list(
TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L), 
SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L), 
mid_sum = c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 2L), 
final_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L)), 
class = "data.frame", row.names = c(NA, -11L))

我在 SO 上寻找了类似的问题,发现了这个 R - 识别连续序列 但它似乎没有帮助我解决我的问题.另一个相关的帖子是确定一个数字序列的时间已经在 R 中被破坏了,但同样,它对我的​​情况没有帮助.

I looked for similar questions on SO and found this R - identify consecutive sequences but it does not seem to help me address my question. Another related post was Determine when a sequence of numbers has been broken in R but again, it does not help in my case.

关于如何解决这个问题的任何建议将不胜感激.

Any advice on how to solve this problem would be greatly appreciated.

谢谢!

推荐答案

也许 accumulate 函数系列就是为这些情况设计的.在这里使用 accumulate2 -

Perhaps accumulate family of functions has been designed for these situations. Using accumulate2 here -

  • 作为第一个参数,我通过 mid_sum
  • 第二个参数是滞后值,即 lag(mid_sum) ,默认为除 NA 之外的任何值和它可能采用的实际值.我认为 0 是安全的
  • .init 提供任何值.我只选择了 c.
  • 如果第一个参数 (..2) [..1 是累积值而不是第一个参数] 小于 ..3 即第二个参数,返回 不一致 else 一致.
  • 现在由于提供了 .init,结果将比提供的值大一个值,因此删除了它的第一个值 [-1]
  • As first argument I am passing through mid_sum
  • second argument is lagged value i.e. lag(mid_sum) with default as any value except NA and actual values it may take. I am taking 0 as safe
  • .init is provided with any value. I chose c only.
  • if first argument (..2) [..1 is accumulated value and not first arg] is less than ..3 i.e. second argument, return inconsistent else consistent.
  • Now since .init is provided the results will be one value large than provided, so stripped its first value [-1]
df <- structure(list(
  TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L), 
  SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L), 
  mid_sum = c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 2L), 
  final_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L)), 
  class = "data.frame", row.names = c(NA, -11L))

library(tidyverse)

df %>%
  arrange(TUserId, final_sum) %>%
  group_by(TUserId) %>%
  mutate(status = unlist(accumulate2(mid_sum, lag(mid_sum, default = 0), .init = 'c', 
                              ~ if(..2 < ..3) 'inconsistent' else 'consistent')[-1]))
#> # A tibble: 11 x 5
#> # Groups:   TUserId [3]
#>    TUserId  SUID mid_sum final_sum status      
#>      <int> <int>   <int>     <int> <chr>       
#>  1     115   201       2         7 consistent  
#>  2     115   309       1         8 inconsistent
#>  3     115   404       1         9 consistent  
#>  4     209   245       2        10 consistent  
#>  5     209   398       2        10 consistent  
#>  6     209   510       2        10 consistent  
#>  7     209   602       1        10 inconsistent
#>  8     371   111       2        11 consistent  
#>  9     371   115       1        11 inconsistent
#> 10     371   123       3        11 consistent  
#> 11     371   124       2        11 inconsistent

reprex 包 (v2.0.0) 于 2021 年 6 月 15 日创建

Created on 2021-06-15 by the reprex package (v2.0.0)

这篇关于根据其他列 UserID 识别数据序列更改的情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆