根据其他列 UserID 识别数据序列更改的情况 [英] Identify cases where data sequence changes based on other column UserIDs
问题描述
我正在处理一个数据框 df
,如下所示:
I am working on a data frame df
which is as below:
TUserId SUID mid_sum final_sum
115 201 2 7
115 309 1 8
115 404 1 9
209 245 2 10
209 398 2 10
209 510 2 10
209 602 1 10
371 111 2 11
371 115 1 11
371 123 3 11
371 124 2 11
1- 我的数据以 wide
格式排列,其中每一行都有一个唯一的学生 ID,显示为 SUID
.
1- My data is arranged in a wide
format, where each row has a unique student ID shown as SUID
.
2- 多个学生可以有相同的老师,因此多行的公共老师 ID 显示为 TUserId
.
2- Several students can have the same teacher and hence the common teacher ID across multiple rows shown as TUserId
.
3- 数据包括学生期中成绩和学生期末成绩.
3- The data includes student scores in mid-terms and then students' final scores.
4- 我很想知道是否有老师在期中给学生类似的分数(如 mid_sum
所示)在他们的期末考试中给出不一致的分数,如图所示通过 final_sum
.如果在数据中发现这种不一致,我想添加一列Status
,记录这个不一致
.
4- I am interested in finding out if there are any instances where a teacher who gave similar scores to their students on mid-terms as shown by mid_sum
gave inconsistent scores on their final exams as shown by final_sum
. If such inconsistency is found in data, I want to add a column Status
that records this inconsistency
.
a- 为此,我的规则是,如果 mid_sum
和 final_sum
按升序排序,就像我在这个示例数据框 df代码>.我想确定升序在这些列
mid_sum
和 final_sum
中的任何一个中中断的情况.
a- For this, my rule is that if mid_sum
and final_sum
are sorted in ascending order, as I have done in this example data frame df
. I want to identify the cases where the ascending sequence breaks in either of these columns mid_sum
and final_sum
.
b- 如果数据没有排序,可以做吗?
b- Can it be done, if the data is not sorted?
例如,对于 SUID = 309
,mid_sum
是前一个 mid_sum
的递减.所以应该标记为inconsistent
.它应该只发生在被同一位老师 TUserId
标记的学生身上,在这种情况下是 115
.
For example, for SUID = 309
, mid_sum
is a decrement from the previous mid_sum
. So it should be marked as inconsistent
. It should only happen for students who were marked by the same teacher TUserId
, which in this case is 115
.
同样,对于SUID = 602
,mid_sum
是从之前的mid_sum
递减.所以应该标记为inconsistent
.同样,它是针对同一个老师 TUserId = 209
Similarly, for SUID = 602
, mid_sum
is a decrement from the previous mid_sum
. So it should be marked as inconsistent
. Again, it is for the same teacher TUserId = 209
为了进一步详细说明,我想要这样的输出:
To elaborate further, I want an output like this:
TUserId SUID mid_sum final_sum Status
115 201 2 7 consistent
115 309 1 8 inconsistent
115 404 1 9 consistent
209 245 2 10 consistent
209 398 2 10 consistent
209 510 2 10 consistent
209 602 1 10 inconsistent
371 111 2 11 consistent
371 115 1 11 inconsistent
371 123 3 11 consistent
371 124 2 11 inconsistent
数据导入dput()
数据框的dput()
如下:
dput(df)
structure(list(
TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L),
SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L),
mid_sum = c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 2L),
final_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L)),
class = "data.frame", row.names = c(NA, -11L))
我在 SO 上寻找了类似的问题,发现了这个 R - 识别连续序列 但它似乎没有帮助我解决我的问题.另一个相关的帖子是确定一个数字序列的时间已经在 R 中被破坏了,但同样,它对我的情况没有帮助.
I looked for similar questions on SO and found this R - identify consecutive sequences but it does not seem to help me address my question. Another related post was Determine when a sequence of numbers has been broken in R but again, it does not help in my case.
关于如何解决这个问题的任何建议将不胜感激.
Any advice on how to solve this problem would be greatly appreciated.
谢谢!
推荐答案
也许 accumulate
函数系列就是为这些情况设计的.在这里使用 accumulate2
-
Perhaps accumulate
family of functions has been designed for these situations. Using accumulate2
here -
- 作为第一个参数,我通过
mid_sum
- 第二个参数是滞后值,即
lag(mid_sum)
,默认为除NA
之外的任何值和它可能采用的实际值.我认为0
是安全的 .init
提供任何值.我只选择了c
.- 如果第一个参数
(..2)
[..1 是累积值而不是第一个参数] 小于..3
即第二个参数,返回不一致
else一致
. - 现在由于提供了
.init
,结果将比提供的值大一个值,因此删除了它的第一个值[-1]
- As first argument I am passing through
mid_sum
- second argument is lagged value i.e.
lag(mid_sum)
with default as any value exceptNA
and actual values it may take. I am taking0
as safe .init
is provided with any value. I chosec
only.- if first argument
(..2)
[..1 is accumulated value and not first arg] is less than..3
i.e. second argument, returninconsistent
elseconsistent
. - Now since
.init
is provided the results will be one value large than provided, so stripped its first value[-1]
df <- structure(list(
TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L),
SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L),
mid_sum = c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 2L),
final_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L)),
class = "data.frame", row.names = c(NA, -11L))
library(tidyverse)
df %>%
arrange(TUserId, final_sum) %>%
group_by(TUserId) %>%
mutate(status = unlist(accumulate2(mid_sum, lag(mid_sum, default = 0), .init = 'c',
~ if(..2 < ..3) 'inconsistent' else 'consistent')[-1]))
#> # A tibble: 11 x 5
#> # Groups: TUserId [3]
#> TUserId SUID mid_sum final_sum status
#> <int> <int> <int> <int> <chr>
#> 1 115 201 2 7 consistent
#> 2 115 309 1 8 inconsistent
#> 3 115 404 1 9 consistent
#> 4 209 245 2 10 consistent
#> 5 209 398 2 10 consistent
#> 6 209 510 2 10 consistent
#> 7 209 602 1 10 inconsistent
#> 8 371 111 2 11 consistent
#> 9 371 115 1 11 inconsistent
#> 10 371 123 3 11 consistent
#> 11 371 124 2 11 inconsistent
由 reprex 包 (v2.0.0) 于 2021 年 6 月 15 日创建上>
Created on 2021-06-15 by the reprex package (v2.0.0)
这篇关于根据其他列 UserID 识别数据序列更改的情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!