根据给定变量识别连续序列 [英] Identify consecutive sequences based on a given variable
问题描述
我确实对此感到困惑。 df1
具有以下变量:
I am literally stuck on this. The df1
has the following variables:
-
serial
=一群人
id1
=来自组(例如 12(序列)1(id1)
=组 12人1; 12 2 =组12人2,等等
。)
id1
= the person from the group (eg. 12 (serial) 1 (id1)
=group 12 person 1; 12 2 = group 12 person 2, etc
. )
'Day
'Day。
这些天包括相等数量的观测值(例如95)
The days consist of equal number of observations (eg.95)
day1 (Monday) = day11-day196
day2 (Tuesday) = day21-day296
day3 (Wednesday) = day31-day396
day4 (Thursday) = day41-day496
day5 (Friday) = day51-day596
day6 (Saturday) = day61-day696
day7 (Sunday) = day71-day796
df1的示例
serial id1 Day day1 day2 day3 day4 day5 day6 day7
12 1 Monday 2 1 2 1 1 3 1
123 1 Tuesday 0 3 0 3 3 0 3
10 1 Wednesday 0 3 3 3 3 3 3
我想确定连续的记录(每日记录之间没有间隔)和记录的总数。
I would like to identify the consecutive records (there is no gap between the daily records) and the total amount of the records.
连续录制的开始日期是 Day变量。例如,连续的记录将是连续的12。记录从星期一开始,并且在一周中有记录(至少有95个变量)。在一周中(7 x 95变量),有11条记录
The starting day for consecutive recordings is the 'Day` variable. For example a consecutive record would be serial 12. Recording started on Monday and there are records (at leas one from 95 variable) during the week. During the week (7 x 95 variable) there were made 11 records
由于第3天和第6天之间存在间隔,因此非连续记录的ID为123。记录从星期二开始,并且在星期三和星期六有一个间隙。
A non-consecutive record would be id 123 as the there is a gap day on day3 and day6. Record started on Tuesday and there is a gap on Wednesday and Saturday.
最后我想记录连续记录的持续时间。
Finally I would like to record the duration of the consecutive recording.
样本输出:
serial id1 Duration Occurance Days
12 1 11 7 day1 day2 day3 day4 day5 day6 day7
123 1 12 0 0
10 1 18 5 day3 day4 day5 day6 day7
样本数据
structure(list(serial = c(12, 123, 10), id1 = c(1, 1, 1), Day = structure(1:3, .Label = c("Monday",
"Tuesday", "Wednesday"), class = "factor"), day1 = c(2, 0, 0),
day2 = c(1, 3, 3), day3 = c(2, 0, 3), day4 = c(1, 3, 3),
day5 = c(1, 3, 3), day6 = c(3, 0, 3), day7 = c(1, 3, 3)), row.names = c(NA,
3L), class = "data.frame")
类似的帖子 R-标识连续的序列
推荐答案
我们可以使用 data.table
中的 rleid
来获取次数正确
We can use rleid
from data.table
to get the 'Occurance' correct
library(data.table)
wkdays <- c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")
out1 <- do.call(rbind, Map(function(x, y) {
i1 <- match(y, wkdays): length(x)
i2 <- x[i1] != 0
i3 <- all(i2)
grp1 <- rleid(i2)
Days <- if(i3) tapply(names(x)[i1][i2], grp1[i2], FUN = paste, collapse= ' ') else ''
Occurance <- if(i3) length(grp1[i2]) else 0
data.frame(Occurance, Days)
}, asplit(df[-(1:3)], 1), df$Day))
out1$Duration <- rowSums(df1[startsWith(names(df1), 'day')])
out1
# Occurance Days Duration
#1 7 day1 day2 day3 day4 day5 day6 day7 11
#2 0 12
#3 5 day3 day4 day5 day6 day7 18
这篇关于根据给定变量识别连续序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!