根据日期范围合并行 [英] Consolidate rows based on date ranges

查看:84
本文介绍了根据日期范围合并行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想合并数据帧的行,以使开始和结束列所描述的范围包括原始数据集中的所有值。可能存在重叠,重复和嵌套范围。某些范围可能会丢失。

I'd like to combine rows of a data frame such that the ranges described by a "start" and "end" column include all values from the original data set. There might be overlaps, repeats, and nested ranges. Some ranges might be missing.

以下是我要折叠的数据类型的示例:

Here's an example of the kind of data I'd like to collapse:

data = data.frame(rbind(
    c("Roger", 1,  10),
    c("Roger", 10, 15),
    c("Roger", 16, 17),
    c("Roger", 3,  6),
    c("Roger", 20, 25),
    c("Roger", NA, NA),
    c("Susan", 2,  8)))
names(data) = c("name", "start", "end")
data$start = as.numeric(as.character(data$start))
data$end = as.numeric(as.character(data$end))

所需的结果将是:

name   start end
Roger  1     17
Roger  20    25
Susan  2     8

我的尝试是扩大范围内的每个项目每行。这行得通,但是我不确定如何将其缩小。另外,我正在使用的完整数据集具有约3000万行和非常大的范围,因此此方法非常慢。

My attempt has been to expand out every item in the range for each row. This works, but then I'm not sure how to shrink it back. Additionally, the full dataset I'm working with has ~30 million rows and very large ranges, so this method is VERY slow.

pb <- txtProgressBar(min = 0, max = length(data$name), style = 3)
mylist = list()
for(i in 1:length(data$name)){
  subdata = data[i,]
  if(is.na(subdata$start)){
    mylist[[i]] = subdata
    mylist[[i]]$each = NA
  }
  if(!is.na(subdata$start)){
    sequence = seq(subdata$start, subdata$end)  
    mylist[[i]] = subdata[rep(1, each = length(sequence)),]
    mylist[[i]]$daily = sequence
  }
  setTxtProgressBar(pb, i)
}

rbindlist(mylist)


推荐答案

我猜测IRanges的效率更高,但是...

I'm guessing IRanges is much more efficient for this, but...

library(data.table)

# remove missing values
DT = na.omit(setDT(data))

# sort
setorder(DT, name, start)

# mark threshold for a new group
DT[, high_so_far := shift(cummax(end), fill=end[1L]), by=name]

# group and summarise
DT[, .(start[1L], end[.N]), by=.( name, g = cumsum(start > high_so_far + 1L) )]

#     name g V1 V2
# 1: Roger 0  1 17
# 2: Roger 1 20 25
# 3: Susan 1  2  8

工作原理:


  • cummax 是累积最大值,因此是迄今为止的最高值,包括当前行。

  • 要获取不包括当前行的值,请使用 shift (从上一行提取)。

  • cumsum(some_condition)是制作分组变量的标准方法。

  • .N 是由 by = 确定的组的最后一行。

  • cummax is the cumulative maximum, so the highest value so far, including the current row.
  • To take the value excluding the current row, use shift (which draws from the prior row).
  • cumsum(some_condition) is a standard way of making a grouping variable.
  • .N is the last row of the group determined by by=.

如果需要,可以在最后一步中命名列,如。(s = start [1L],e = end [.N])

The columns can be named in the last step like .(s = start[1L], e = end[.N]) if desired.

有日期间隔。如果要处理日期,我建议使用 IDate 类;只需使用 as.IDate 转换 Date

With date intervals. If working with dates, I'd suggest the IDate class; just use as.IDate to convert a Date.

我们可以按日期 +1 ,但是不幸的是不能 cummax ,所以...

We can +1 on dates, but unfortunately cannot cummax, so...

cummax_idate = function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))

set.seed(1)
d = sample(as.IDate("2011-11-11") + 1:10)
cummax_idate(d)
#  [1] "2011-11-14" "2011-11-15" "2011-11-16" "2011-11-18" "2011-11-18"
#  [6] "2011-11-19" "2011-11-20" "2011-11-20" "2011-11-21" "2011-11-21"

我认为可以使用此函数代替 cummax

I think this function can be used in place of cummax.

函数中多余的()在那里,因为 setattr 不会打印其输出。

The extra () in the function are there because setattr won't print its output.

这篇关于根据日期范围合并行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆