查找运行的起点和终点/指数/连续值 [英] Find start and end positions/indices of runs/consecutive values
问题描述
问题:给定一个原子向量,在向量中找到运行的开始和结束索引.
Problem: Given an atomic vector, find the start and end indices of runs in the vector.
具有运行的矢量示例:
x = rev(rep(6:10, 1:5))
# [1] 10 10 10 10 10 9 9 9 9 8 8 8 7 7 6
rle()
的输出:
rle(x)
# Run Length Encoding
# lengths: int [1:5] 5 4 3 2 1
# values : int [1:5] 10 9 8 7 6
所需的输出:
# start end
# 1 1 5
# 2 6 9
# 3 10 12
# 4 13 14
# 5 15 15
基类rle
似乎没有提供此功能,但类 rle2
做.但是,鉴于功能的重要性,与安装和加载其他软件包相比,坚持使用R似乎更明智.
The base rle
class doesn't appear to provide this functionality, but the class Rle
and function rle2
do. However, given how minor the functionality is, sticking to base R seems more sensible than installing and loading additional packages.
有一些代码段示例(此处和 SO上),它解决了查找起点和终点的稍微不同的问题满足某些条件的运行的结束索引.我想要一种更通用的东西,可以在一行中执行,并且不涉及临时变量或值的分配.
There are examples of code snippets (here, here and on SO) which solve the slightly different problem of finding start and end indices for runs which satisfy some condition. I wanted something that would be more general, could be performed in one line, and didn't involve the assignment of temporary variables or values.
回答我自己的问题,因为我对缺少搜索结果感到沮丧.我希望这对某人有帮助!
Answering my own question because I was frustrated by the lack of search results. I hope this helps somebody!
推荐答案
核心逻辑:
# Example vector and rle object
x = rev(rep(6:10, 1:5))
rle_x = rle(x)
# Compute endpoints of run
end = cumsum(rle_x$lengths)
start = c(1, lag(end)[-1] + 1)
# Display results
data.frame(start, end)
# start end
# 1 1 5
# 2 6 9
# 3 10 12
# 4 13 14
# 5 15 15
Tidyverse/dplyr
方式(以数据帧为中心):
Tidyverse/dplyr
way (data frame-centric):
library(dplyr)
rle(x) %>%
unclass() %>%
as.data.frame() %>%
mutate(end = cumsum(lengths),
start = c(1, dplyr::lag(end)[-1] + 1)) %>%
magrittr::extract(c(1,2,4,3)) # To re-order start before end for display
由于start
和end
向量的长度与rle
对象的values
分量的长度相同,因此解决为满足某些条件的运行确定端点的相关问题很简单:filter
或子集start
和end
向量使用运行值上的条件.
Because the start
and end
vectors are the same length as the values
component of the rle
object, solving the related problem of identifying endpoints for runs meeting some condition is straightforward: filter
or subset the start
and end
vectors using the condition on the run values.
这篇关于查找运行的起点和终点/指数/连续值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!