根据时间序列中的条件对行进行分组,并忽略错误的值 [英] Group rows based on condition in time series and ignoring false values

查看:75
本文介绍了根据时间序列中的条件对行进行分组,并忽略错误的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组动物,它们的采样间隔不同.我想做的是对采样间隔符合特定条件(例如低于特定值)的序列进行分组和标记. 这是对此问题的修订,该问题被标记为与

I have a set of animal locations with different sampling intervals. What I want to do is group and label the sequences where the sampling interval matches a certain criteria (e.g. is below a certain value). This is a revision of this question which was marked as a duplicate of this one. The difference in this revised question is the fact that all values that do NOT match the criteria should be ignored, not labeled.

让我用一些虚拟数据进行说明:

Let me illustrate with some dummy data:

start <- Sys.time()
timediff <- c(rep(5,3),rep(20,3),rep(5,2))
timediff <- cumsum(timediff)

# Set up a dataframe with a couple of time values
df <- data.frame(TimeDate = start + timediff)

# For understanding purposes, I will note the time differences in a separate column
df$TimeDiff <- c(diff(df$TimeDate),NA)

使用@Josh O'Brien的答案,可以定义一个对满足特定条件的值进行分组的函数.

Using the @Josh O'Brien's answer, one could define a function that groups values which meet a specific criteria.

number.groups <- function(input){
  input[is.na(input)] <- FALSE # to eliminate NA
  return(head(cumsum(c(TRUE,!input)),-1))
}

# Define the criteria and apply the function
df$Group <- number.groups(df$TimeDiff <= 5)

# output
             TimeDate TimeDiff Group
1 2016-03-16 15:41:51        5     1
2 2016-03-16 15:41:56        5     1
3 2016-03-16 15:42:01       20     1
4 2016-03-16 15:42:21       20     2
5 2016-03-16 15:42:41       20     3
6 2016-03-16 15:43:01        5     4
7 2016-03-16 15:43:06        5     4
8 2016-03-16 15:43:11       NA     4

这里的问题是第4行和第5行被标记为单独的组,而不是被忽略.有没有一种方法可以确保将不属于某个组的值不进行分组(例如,保持不适用)?

The issue here is that rows 4 and 5 are labeled as individual groups, rather than ignored. Is there a way to make sure that values that DO NOT belong to a group are NOT grouped (e.g. stay NA)?

推荐答案

我想我已经找到一种解决问题的方法.方法是将每个值与下一个值进行比较,并使用此信息消除唯一值.然后,通过将剩余的值分解成因子来重命名它们.

I think I've found a way to solve the problem. The approach is to compare each value with the next and use this information to eliminate unique values. Then, rename the remaining values by turing them into factors.

number.groups <- function(input){
  # Replace NAs with FALSE for cumsum() to work
  input[is.na(input)] <- FALSE 
  # Make Groups using cumsum()
  group = (head(cumsum(c(TRUE,!input)),-1))
  # Compare each value with the next
  compare <- head(group,-1) == tail(group,-1)
  # determine unique values
  uniques <- !(c(compare,F) | c(F,compare))
  # remove unique values
  group[which(uniques)] <- NA
  # convert into factors
  group <- as.factor(group)
  # rename the factors
  levels(group) <- 1:length(levels(group))
  return(group)
}

# apply the function
df$Group <- number.groups(df$TimeDiff <= 5)

# output
             TimeDate TimeDiff Group
1 2016-03-17 15:44:28        5     1
2 2016-03-17 15:44:33        5     1
3 2016-03-17 15:44:38       20     1
4 2016-03-17 15:44:58       20  <NA>
5 2016-03-17 15:45:18       20  <NA>
6 2016-03-17 15:45:38        5     2
7 2016-03-17 15:45:43        5     2
8 2016-03-17 15:45:48       NA     2

这篇关于根据时间序列中的条件对行进行分组,并忽略错误的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆