从R中data.table的列中删除乱序编号 [英] Removing an out of sequence number from a column in data.table in R

查看:111
本文介绍了从R中data.table的列中删除乱序编号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据表dt,具有三列nm,seqn和obj

I have a data table dt, with three columns nm, seqn and obj

> nm <- letters[1:22]
> seqn <- c(32,36, 86,45 , 47, 48, 49,
+            52, 54, 59, 
+            66, 9, 69, 74, 81, 88, 90, 91, 93, 94, 95, 97)
> obj <- rep(c('c1', 'c2', 'c3'), c(7, 3, 12))
> dt <- data.table(nm, seqn, obj)
> dt
    nm seqn obj
 1:  a   32  c1
 2:  b   36  c1
 3:  c   86  c1
 4:  d   45  c1
 5:  e   47  c1
 6:  f   48  c1
 7:  g   49  c1
 8:  h   52  c2
 9:  i   54  c2
10:  j   59  c2
11:  k   66  c3
12:  l    9  c3
13:  m   69  c3
14:  n   74  c3
15:  o   81  c3
16:  p   88  c3
17:  q   90  c3
18:  r   91  c3
19:  s   93  c3
20:  t   94  c3
21:  u   95  c3
22:  v   97  c3

我想为每个 obj获取单调的 seqn序列组。我想删除obj 86(记录3)之类的乱序数字,如果是obj c1(*这里的86是一个大数字,而通常是一系列小的单调seqn数字),而如果是obj c3,我想要删除seqn9。(记录12)(*这里9是大数的单调seqn中的一个小数字)。

I want to get a monotonous sequence of "seqn" for each "obj" group. I want to remove the out of sequence numbers like 86(record 3) in case of obj "c1" (* here 86 is a big number while the usual series of small monotonous seqn numbers) and in case of obj "c3" , I want to remove seqn 9. (record 12) (* here 9 is a small number in monotonous seqn of big numbers ).

如何用data.table做到这一点/ dataframe。

How can I do this with data.table/dataframe.

推荐答案

这是另一个 data.table 解决方案,它是与此评论

Here is another data.table solution which is different from the solution suggested in this comment.

OP要求为每个获取单价序列 seqn obj 。此外,OP还具有详细信息,他需要删除一个较大的数字(当其开头是小数,然后是较小的数字),然后删除较小的数字(当其在前面且是大数之后)。尽管没有明确说明,但可以根据提供的数据得出结论,OP指的是单调递增序列。

The OP has requested to get a monotonous sequence of seqn for each obj group. In addition, the OP has detailed that he needs to remove a larger number when it is preceeded and followed by smaller numbers and removing a smaller number when it is preceeded and followed by larger numbers. Although not stated explicitely, it can be concluded from the supplied data that the OP is referring to monotonically increasing sequences.

library(data.table)
DT[-DT[, .I[which(xor(
  shift(seqn) < shift(seqn, type = "lead"),
  between(seqn, shift(seqn), shift(seqn, type = "lead"))
))], by = obj]$V1]
#    nm seqn obj
# 1:  a   32  c1
# 2:  b   36  c1
# 3:  d   45  c1
# 4:  e   47  c1
# 5:  f   48  c1
# 6:  g   49  c1
# 7:  h   52  c2
# 8:  i   54  c2
# 9:  j   59  c2
#10:  k   66  c3
#11:  m   69  c3
#12:  n   74  c3
#13:  o   81  c3
#14:  p   88  c3
#15:  q   90  c3
#16:  r   91  c3
#17:  s   93  c3
#18:  t   94  c3
#19:  u   95  c3
#20:  v   97  c3



数据



Data

library(data.table)
nm <- letters[1:22]
seqn <- c(32,36, 86,45 , 47, 48, 49, 52, 54, 59, 
          66, 9, 69, 74, 81, 88, 90, 91, 93, 94, 95, 97)
obj <- rep(c('c1', 'c2', 'c3'), c(7, 3, 12))
DT <- data.table(nm, seqn, obj)






在每个序列的开始和结束处违反单调性



可以增强上述方法以涵盖边缘情况在每个序列的开头或结尾违反单调性的地方,即,对于每个 obj 组。

seqn <- c(32,36, 86, 45, 47, -48, 49, 52, 54, 59, 
          66, 9, 13, 74, 81, 88, 90, 91, 93, 94, 95, 11)
(DT <- data.table(nm, seqn, obj))
#    nm seqn obj
# 1:  a   32  c1
# 2:  b   36  c1
# 3:  c   86  c1
# 4:  d   45  c1
# 5:  e   47  c1
# 6:  f   48  c1
# 7:  g   49  c1
# 8:  h   52  c2
# 9:  i   54  c2
#10:  j   59  c2
#11:  k   66  c3
#12:  l    9  c3
#13:  m   13  c3
#14:  n   74  c3
#15:  o   81  c3
#16:  p   88  c3
#17:  q   90  c3
#18:  r   91  c3
#19:  s   93  c3
#20:  t   94  c3
#21:  u   95  c3
#22:  v   11  c3
#    nm seqn obj

请注意 DT 已在第13和22行中进行了更改。现在, obj c3 的第一个和最后一个元素已变为离群值。第一个元素66大于接下来的两个元素9和13,最后一个元素11低于前面的元素95。因此,单调递增的序列以9开始,以95结尾,元素66和11必须

Note that DT has been changed in rows 13 and 22. Now, the first and the last elements of obj group c3 have become "outliers". The first element 66 is larger than the next two elements 9 and 13, and the last element 11 is lower than the preceeding element 95. So, the monotonically increasing sequence starts with 9 and ends with 95, and the elements 66 and 11 have to be removed.

只需在每个序列上填充前导 -Inf 和后缀 + Inf 。除了必须将结果移回以选择正确的行号之外,不需要对代码进行其他更改:

This is achieved by simply padding each sequence with a leading -Inf and a trailing +Inf. No other change to the code is required except that the result has to be shifted back to pick the correct row number:

DT[-DT[, {seqn <- c(-Inf, seqn, +Inf); .I[which(shift(xor(
  shift(seqn) < shift(seqn, type = "lead"),
  between(seqn, shift(seqn), shift(seqn, type = "lead"))
), type = "lead"))]}, by = obj]$V1]
#    nm seqn obj
# 1:  a   32  c1
# 2:  b   36  c1
# 3:  d   45  c1
# 4:  e   47  c1
# 5:  f   48  c1
# 6:  g   49  c1
# 7:  h   52  c2
# 8:  i   54  c2
# 9:  j   59  c2
#10:  l    9  c3
#11:  m   13  c3
#12:  n   74  c3
#13:  o   81  c3
#14:  p   88  c3
#15:  q   90  c3
#16:  r   91  c3
#17:  s   93  c3
#18:  t   94  c3
#19:  u   95  c3

这篇关于从R中data.table的列中删除乱序编号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆