使用 R 进行序列长度编码 [英] Sequence length encoding using R
问题描述
有没有办法在R中对递增的整数序列进行编码,类似于使用游程长度编码(rle
)对游程长度进行编码?
Is there a way to encode increasing integer sequences in R, analogous to encoding run lengths using run length encoding (rle
)?
我会用一个例子来说明:
I'll illustrate with an example:
类比:行程编码
r <- c(rep(1, 4), 2, 3, 4, rep(5, 5))
rle(r)
Run Length Encoding
lengths: int [1:5] 4 1 1 1 5
values : num [1:5] 1 2 3 4 5
期望:序列长度编码
s <- c(1:4, rep(5, 4), 6:9)
s
[1] 1 2 3 4 5 5 5 5 6 7 8 9
somefunction(s)
Sequence lengths
lengths: int [1:4] 5 1 1 5
value1 : num [1:4] 1 5 5 5
编辑 1
因此,somefunction(1:10)
将给出结果:
Sequence lengths
lengths: int [1:1] 10
value1 : num [1:1] 1
这个结果意味着有一个长度为10的整数序列,起始值为1,即seq(1, 10)
This results means that there is an integer sequence of length 10 with starting value of 1, i.e. seq(1, 10)
请注意,我的示例结果中没有错误.该向量实际上以序列 5:9 结尾,而不是用于构建它的 6:9.
Note that there isn't a mistake in my example result. The vector in fact ends in the sequence 5:9, not 6:9 which was used to construct it.
我的用例是我正在处理 SPSS 导出文件中的调查数据.问题网格中的每个子问题都会有一个模式名称paste("q", 1:5)
,但有时会有一个其他"类别标记为q_99
、q_other
或其他东西.我希望找到一种识别序列的方法.
My use case is that I am working with survey data in an SPSS export file. Each subquestion in a grid of questions will have a name of the pattern paste("q", 1:5)
, but sometimes there is an "other" category which will be marked q_99
, q_other
or something else. I wish to find a way of identifying the sequences.
编辑 2
在某种程度上,我想要的函数是基本函数 sequence
的逆函数,在我的示例中添加了起始值 value1
.
In a way, my desired function is the inverse of the base function sequence
, with the start value, value1
in my example, added.
lengths <- c(5, 1, 1, 5)
value1 <- c(1, 5, 5, 5)
s
[1] 1 2 3 4 5 5 5 5 6 7 8 9
sequence(lengths) + rep(value1-1, lengths)
[1] 1 2 3 4 5 5 5 5 6 7 8 9
编辑 3
我应该说,就我的目的而言,序列被定义为递增的整数序列,而不是单调递增的序列,例如c(4,5,6,7)
但不是 c(2,4,6,8)
也不是 c(5,4,3,2,1)
.但是,任何其他整数都可以出现在序列之间.
I should have stated that for my purposes a sequence is defined as increasing integer sequences as opposed to monotonically increasing sequences, e.g. c(4,5,6,7)
but not c(2,4,6,8)
nor c(5,4,3,2,1)
. However, any other integer can appear between sequences.
这意味着一个解决方案应该能够处理这个测试用例:
This means a solution should be able to cope with this test case:
somefunction(c(2, 4, 1:4, 5, 5))
Sequence lengths
lengths: int [1:4] 1 1 5 1
value1 : num [1:4] 2 4 1 5
在理想情况下,该解决方案还可以处理最初建议的用例,其中包括向量中的字符,例如
In the ideal case, the solution can also cope with the use case suggested originally, which would include characters in the vector, e.g.
somefunction(c(2, 4, 1:4, 5, "other"))
Sequence lengths
lengths: int [1:5] 1 1 5 1 1
value1 : num [1:5] 2 4 1 5 "other"
推荐答案
EDIT :添加控制以处理字符向量.
EDIT : added control to do the character vectors as well.
基于rle,我得出以下解决方案:
Based on rle, I come to following solution :
somefunction <- function(x){
if(!is.numeric(x)) x <- as.numeric(x)
n <- length(x)
y <- x[-1L] != x[-n] + 1L
i <- c(which(y|is.na(y)),n)
list(
lengths = diff(c(0L,i)),
values = x[head(c(0L,i)+1L,-1L)]
)
}
> s <- c(2,4,1:4, rep(5, 4), 6:9,4,4,4)
> somefunction(s)
$lengths
[1] 1 1 5 1 1 5 1 1 1
$values
[1] 2 4 1 5 5 5 4 4 4
这个适用于我尝试过的每个测试用例,并使用没有 ifelse 子句的矢量化值.应该跑得更快.它将字符串转换为 NA,因此您可以保留数字输出.
This one works on every test case I tried and uses vectorized values without ifelse clauses. Should run faster. It converts strings to NA, so you keep a numeric output.
> S <- c(4,2,1:5,5, "other" , "other",4:6,2)
> somefunction(S)
$lengths
[1] 1 1 5 1 1 1 3 1
$values
[1] 4 2 1 5 NA NA 4 2
Warning message:
In somefunction(S) : NAs introduced by coercion
这篇关于使用 R 进行序列长度编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!