查找单调序列,并在考虑到最大值时重新考虑序列 [英] Finding monotonous sequence along with taking sequence restart on reaching maximum into account

查看:66
本文介绍了查找单调序列,并在考虑到最大值时重新考虑序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据表说dt

I have a data.table say dt

name <- letters[1:22]
score <- c(42, 82, 43, 32, 47, 48, 49, 50, 54, 59, 
           76, 9, 13, 88, 91, 99, 4, 6, 8, 12, 14, 15)
class <- rep(c('c1', 'c2', 'c3'), c(7, 3, 12))
dt <- data.table(name, score, class)

它看起来像:

> dt
    name score class
 1:    a    42    c1
 2:    b    82    c1
 3:    c    43    c1
 4:    d    32    c1
 5:    e    47    c1
 6:    f    48    c1
 7:    g    49    c1
 8:    h    50    c2
 9:    i    54    c2
10:    j    59    c2
11:    k    76    c3
12:    l     9    c3
13:    m    13    c3
14:    n    88    c3
15:    o    91    c3
16:    p    99    c3
17:    q     4    c3
18:    r     6    c3
19:    s     8    c3
20:    t    12    c3
21:    u    14    c3
22:    v    15    c3

我只需要那些遵循单调得分顺序的记录类。在这种情况下,仅记录类别c1的分数为42,43,47,48 49的记录,对于给定类别,最多可以有3个连续的乱序分数。因此,第2行(分数= 82)也是乱序得分。

I only require those records which follow a monotonous sequence of the score for each class. in this case only records with score 42, 43,47,48 49 for class c1, There can be at maximum 3 consecutive out of sequence scores for a given class. row 2(score = 82) is hence also a out of sequence score.

对于c2类,得分为50、54、59的记录。

Records with score 50, 54, 59 for class c2.

在类 c3中的分数为76,88,91,99,04,06,08,12,14,15的记录。这里的序列已达到最大值(99),然后重新开始。 c3类中的得分09和13超出了单调序列,因此需要删除。

In class "c3" records with score 76,88,91,99,04,06,08,12, 14, 15. Here the sequence have reached the maximum(99) and then have restarted. Scores 09 and 13 in class "c3" were out of the monotonous sequence hence needed to be removed.

我想删除那些对于c1,c2,c3类中的得分未按顺序排列的记录。总共有100万条记录。

I want to remove those records where score mentioned are not in sequence for each of the class c1, c2, c3. There are in total 1 million records.

最终输出必须像这样。

> dt
    name score class
 1:    a    42    c1
 2:    c    43    c1
 3:    e    47    c1
 4:    f    48    c1
 5:    g    49    c1
 6:    h    50    c2
 7:    i    54    c2
 8:    j    59    c2
 9:    k    76    c3
10:    n    88    c3
11:    o    91    c3
12:    p    99    c3
13:    q     4    c3
14:    r     6    c3
15:    s     8    c3
16:    t    12    c3
17:    u    14    c3
18:    v    15    c3

在为了找到我尝试过的单调序列:

In order to find monotonous sequence I have tried:

dt <- dt[, .SD[score == cummax(score)],class]

,但这也删除了达到最大值后重新启动的序列。

but this is also removing the sequence which are restarting after reaching the maximum value.

实际上,如果重新启动序列,则最大值为999999,尽管在本示例中,最大值为99。如何执行此操作。

In actual the maximum of sequence restart if 999999, though for this example I have taken maximum as 99. How can I do this.

推荐答案

通常可以使用 dplyr

dts <- dt %>% 
       group_by(class) %>% 
       mutate(f = ifelse( (score - lead(score) > 0 & lag(score) - score <0) | 
                          (score - lead(score) < 0 & lag(score) - score > 0) , 1, 0)) %>%
       mutate(f = ifelse(is.na(f), 0, f)) %>%
       mutate(g = ifelse((lead(f) == 1 & f == 1)| (lag(f) == 1 & f == 1 ), 2, 0) )) %>%
       filter(f + g != 1)

正如我所说,这通常会带您到那里。问题是,您将获得19个观测值(保留 id = m ),而不是18个。您可以在上重新运行该观测值。 dts 消除 id = m 。或者,如果这是较大集合的子集,则可以在循环时使用。其原因是因为 lead lag 函数仅检查上方和下方的一个索引。

As I said, this will mostly get you there. The problem with this is you will get 19 observations (retaining id = m) as opposed to 18. What you can do is re-run this on dts to eliminate id = m. Or if this is a subset of a larger set, you can use for or while loops. The reason for this is because the lead and lag function only check one index above and below.

另一种选择是一种古老的技术,即推式弹出技术,但我会尽量避免。

Another option is an old school technique known as a push-pop technique, but I would stay away from that.

这篇关于查找单调序列,并在考虑到最大值时重新考虑序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆