查找单调序列,并在考虑到最大值时重新考虑序列 [英] Finding monotonous sequence along with taking sequence restart on reaching maximum into account
问题描述
我有一个数据表说dt
I have a data.table say dt
name <- letters[1:22]
score <- c(42, 82, 43, 32, 47, 48, 49, 50, 54, 59,
76, 9, 13, 88, 91, 99, 4, 6, 8, 12, 14, 15)
class <- rep(c('c1', 'c2', 'c3'), c(7, 3, 12))
dt <- data.table(name, score, class)
它看起来像:
> dt
name score class
1: a 42 c1
2: b 82 c1
3: c 43 c1
4: d 32 c1
5: e 47 c1
6: f 48 c1
7: g 49 c1
8: h 50 c2
9: i 54 c2
10: j 59 c2
11: k 76 c3
12: l 9 c3
13: m 13 c3
14: n 88 c3
15: o 91 c3
16: p 99 c3
17: q 4 c3
18: r 6 c3
19: s 8 c3
20: t 12 c3
21: u 14 c3
22: v 15 c3
我只需要那些遵循单调得分顺序的记录类。在这种情况下,仅记录类别c1的分数为42,43,47,48 49的记录,对于给定类别,最多可以有3个连续的乱序分数。因此,第2行(分数= 82)也是乱序得分。
I only require those records which follow a monotonous sequence of the score for each class. in this case only records with score 42, 43,47,48 49 for class c1, There can be at maximum 3 consecutive out of sequence scores for a given class. row 2(score = 82) is hence also a out of sequence score.
对于c2类,得分为50、54、59的记录。
Records with score 50, 54, 59 for class c2.
在类 c3中的分数为76,88,91,99,04,06,08,12,14,15的记录。这里的序列已达到最大值(99),然后重新开始。 c3类中的得分09和13超出了单调序列,因此需要删除。
In class "c3" records with score 76,88,91,99,04,06,08,12, 14, 15. Here the sequence have reached the maximum(99) and then have restarted. Scores 09 and 13 in class "c3" were out of the monotonous sequence hence needed to be removed.
我想删除那些对于c1,c2,c3类中的得分未按顺序排列的记录。总共有100万条记录。
I want to remove those records where score mentioned are not in sequence for each of the class c1, c2, c3. There are in total 1 million records.
最终输出必须像这样。
> dt
name score class
1: a 42 c1
2: c 43 c1
3: e 47 c1
4: f 48 c1
5: g 49 c1
6: h 50 c2
7: i 54 c2
8: j 59 c2
9: k 76 c3
10: n 88 c3
11: o 91 c3
12: p 99 c3
13: q 4 c3
14: r 6 c3
15: s 8 c3
16: t 12 c3
17: u 14 c3
18: v 15 c3
在为了找到我尝试过的单调序列:
In order to find monotonous sequence I have tried:
dt <- dt[, .SD[score == cummax(score)],class]
,但这也删除了达到最大值后重新启动的序列。
but this is also removing the sequence which are restarting after reaching the maximum value.
实际上,如果重新启动序列,则最大值为999999,尽管在本示例中,最大值为99。如何执行此操作。
In actual the maximum of sequence restart if 999999, though for this example I have taken maximum as 99. How can I do this.
推荐答案
通常可以使用 dplyr
dts <- dt %>%
group_by(class) %>%
mutate(f = ifelse( (score - lead(score) > 0 & lag(score) - score <0) |
(score - lead(score) < 0 & lag(score) - score > 0) , 1, 0)) %>%
mutate(f = ifelse(is.na(f), 0, f)) %>%
mutate(g = ifelse((lead(f) == 1 & f == 1)| (lag(f) == 1 & f == 1 ), 2, 0) )) %>%
filter(f + g != 1)
正如我所说,这通常会带您到那里。问题是,您将获得19个观测值(保留 id = m
),而不是18个。您可以在上重新运行该观测值。 dts
消除 id = m
。或者,如果这是较大集合的子集,则可以在或
循环时使用
。其原因是因为
lead
和 lag
函数仅检查上方和下方的一个索引。
As I said, this will mostly get you there. The problem with this is you will get 19 observations (retaining id = m
) as opposed to 18. What you can do is re-run this on dts
to eliminate id = m
. Or if this is a subset of a larger set, you can use for
or while
loops. The reason for this is because the lead
and lag
function only check one index above and below.
另一种选择是一种古老的技术,即推式弹出技术,但我会尽量避免。
Another option is an old school technique known as a push-pop technique, but I would stay away from that.
这篇关于查找单调序列,并在考虑到最大值时重新考虑序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!