“平滑化”时间数据 - 可以做到更高效吗? [英] "smoothing" time data - can it be done more efficient?

查看:103
本文介绍了“平滑化”时间数据 - 可以做到更高效吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含ID,开始日期和结束日期的数据框。我的数据按ID,开始,结束(按照这个顺序)排序。



现在我想要所有具有相同ID的行具有重叠的时间跨度(或开始日期是在另一行的结束日期之后的日期)合并在一起。



合并它们意味着它们最终在一行中具有相同的ID, min(开始日期)和最大(结束日期)(我希望你明白我的意思)。



我已经写了一个功能(它没有完全测试,但现在看起来不错)。问题是,由于我的数据框架有近100000个观察结果,功能非常慢。



可以帮助我提高效率方面的功能吗?



这是函数

  smoothingEpisodes<  -  function(theData){
theOutput< - data.frame()

curId< - theData [1,ID]
curStart< - theData [1,START]
curEnd< - theData [1,END]

for(i in 2:nrow(theData)){
nextId < - theData [i,ID]
nextStart< - theData [i,START]
nextEnd< - theData [i,END]

if(curId!= nextId |(curEnd + 1 )< nextStart){
theOutput< - rbind(theOutput,data.frame(ID= curId,START= curStart,END= curEnd))

curId < - nextId
curStart< - nextStart
curEnd< - nextEnd
} else {
curEnd < - max(curEnd,ne xtEnd,na.rm = TRUE)
}
}
theOutput< - rbind(theOutput,data.frame(ID= curId,START= curStart,END= curEnd))

theOutput
}

谢谢! / p>



测试数据:

  ID开始结束
1 1 2000-01-01 2000-03-31
2 1 2000-04-01 2000-05- 31
3 1 2000-04-15 2000-07-31
4 1 2000-09-01 2000-10-31
5 2 2000-01-15 2000-03-31
6 2 2000-02-01 2000-03-15
7 2 2000-04-01 2000-04-15
8 3 2000-06-01 2000-06-15
9 3 2000-07-01 2000-07-15

(START和END有数据类型日期,ID是数字)



数据输入:

  structure(list(ID = c(1L,1L,1L,1L,2L,2L,2L,3L,3L),START = structure(c(10957,
11048,11062,11201,10971, 10988,11048,11159,11139),class =Date),
END = str结构(c(11047,11108,11169,11261,11047,11031,
11062,11123,11153),class =Date)),.Names = c(ID,
START ,END),class =data.frame,row.names = c(NA,9L))


解决方案

第一个[没有真正考虑到你正在尝试做什么]优化我建议分配存储为 theOutput 。目前,您在循环的每次迭代中都在 theOutput 中增长。在R中绝对是否否!这是你永远不会做的,除非你喜欢悲惨的慢代码。 R必须复制对象并在每次迭代期间进行扩展,而且缓慢。



查看代码,我们知道 theOutput 需要有 nrow(theData) - 1 rows和3列。所以在循环开始之前创建:

  theOutput<  -  data.frame(matrix(ncol = 3,nrow = nrow theData) -  1))

然后在循环中填写此对象:

  theOutput [i,]<  -  data.frame(ID= curId,START= curStart,END= curEnd))

例如。



清楚什么 START END 是?如果这些是数字,那么使用矩阵而不是数据框架也可以提高速度效率。



此外,每次迭代都会创建一个数据帧。我不能花时间花费大量的时间,但是您可以直接填写所需的数据,而不会导致 data.frame()每次迭代:

  theOutput [i,ID]<  -  curId 
theOutput [i,START ]< - curStart
theOutput [i,END]< - curEnd

我可以给你的最好的提示是配置你的代码。看看瓶颈在哪里,并加快速度。在较小的数据子集上运行您的函数;其大小足以给您一些运行时间来收集有用的分析数据,而无需等待年龄才能完成分析运行。要在R中配置文件,请使用 Rprof()

  Rprof =my_fun_profile.Rprof)
##在数据子集上运行你的函数调用
Rprof(NULL)

您可以使用

  summaryRprof(my_fun_profile.Rprof )

Hadley Wickham(@hadley)有一个包,使这更容易一些。它被称为 profr 。而当Dirk在评论中提醒我时,还有Luke Tierney的 proftools 包。



编辑:,因为OP提供了一些测试数据,通过简单的循环实践实现加速:

  smoothingEpisodes2<  -  function(theData){
curId< ; - theData [1,ID]
curStart< - theData [1,START]
curEnd< - theData [1,END]
nr < nrow(theData)
out1 < - integer(length = nr)
out2 < - out3 < - numeric(length = nr)
for(i in 2:nrow(theData) ){
nextId < - theData [i,ID]
nextStart< - theData [i,START]
nextEnd< - theData [i,END]
if(curId!= nextId |(curEnd + 1)< nextStart){
out1 [i-1]< - curId
out2 [i-1] curStart
out3 [i-1]< - curEnd
curId< - nextId
curStart< - nextStart
curEnd< - nextEnd
} else {
curEnd< - max(curEnd,nextEnd,na.rm = TRUE)
}
}
out1 [i]< - curId
out2 [i] ; - curStart
out3 [i]< - curEnd
theOutput< - data.frame(ID = out1,
START = as.Date(out2,origin =1970-01- 01),
END = as.Date(out3,origin =1970-01-01))
## drop empty
theOutput< - theOutput [-which(theOutput $ ID == 0),]
theOutput
}

使用测试数据集提供对象 testData ,我得到:

 > res1<  -  smoothingEpisodes(testData)
> system.time(复制(100,smoothingEpisodes(testData)))
用户系统已用
1.091 0.000 1.131
> res2< - smoothingEpisodes2(testData)
> system.time(复制(100,smoothingEpisodes2(testData)))
用户系统已用
0.506 0.004 0.517

50%的速度加快。不是戏剧性而且简单,只是通过在每次迭代中不增加对象来实现。


I have a data frame containing an ID, a start date and an end date. My data is ordered by ID, start, end (in this sequence).

Now I want all rows with the same ID having an overlapping time span (or have a start date that is right the day after the end date of another row) to be merged together.

Merging them means that they end up in one row having the same ID, the min(start date) and the max(end date) (I hope you understand what I mean).

I have written a function for that (it is not fully tested, but it looks fine for the moment). The problem is, as my data frame has nearly 100.000 observations, the function is very slow.

Can you help me improve my function in terms of efficiency?

Here is the function

smoothingEpisodes <- function (theData) {
    theOutput <- data.frame()

    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]

    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]

        if (curId != nextId | (curEnd + 1) < nextStart) {
            theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))

            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))

    theOutput
}

Thank you!

[edit]

test data:

    ID      START        END
1    1 2000-01-01 2000-03-31
2    1 2000-04-01 2000-05-31
3    1 2000-04-15 2000-07-31
4    1 2000-09-01 2000-10-31
5    2 2000-01-15 2000-03-31
6    2 2000-02-01 2000-03-15
7    2 2000-04-01 2000-04-15
8    3 2000-06-01 2000-06-15
9    3 2000-07-01 2000-07-15

(START and END have data type "Date", ID is a numeric)

A dput of the data:

structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), 
    END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 
    11062, 11123, 11153), class = "Date")), .Names = c("ID", 
"START", "END"), class = "data.frame", row.names = c(NA, 9L))

解决方案

The first [without really thinking to hard about what you are trying to do] optimisation I would suggest is to allocate storage for theOutput. At the moment, you are growing theOutput at each iteration of the loop. In R that is an absolute no no!! That is something you never do, unless you like woefully slow code. R has to copy the object and expand it during each iteration and that is slow.

Looking at the code, we know that theOutput needs to have nrow(theData) - 1 rows, and 3 columns. So create that before the loop starts:

theOutput <- data.frame(matrix(ncol = 3, nrow = nrow(theData) - 1))

then fill in this object during the loop:

theOutput[i, ] <- data.frame("ID" = curId, "START" = curStart, "END" = curEnd))

for example.

It isn't clear what START and END are? if these are numerics, then working with a matrix and not a data frame could also improve speed efficiency.

Also, creating a data frame each iteration is going to be slow. I can't time this without spending a lot of my own time, but you could just fill in the bits you want directly, without incurring the data.frame() call during each iteration:

theOutput[i, "ID"] <- curId
theOutput[i, "START"] <- curStart
theOutput[i, "END"] <- curEnd

The best tip I can give you however, is to profile your code. See where the bottlenecks are and speed those up. Run your function on a smaller subset of the data; the size of which is sufficient to give you a bit of run-time to gather useful profiling data without having to wait for ages to get the profiling run completed. To profile in R, use Rprof():

Rprof(filename = "my_fun_profile.Rprof")
## run your function call here on a subset of the data
Rprof(NULL)

The you can look at the output using

summaryRprof("my_fun_profile.Rprof")

Hadley Wickham (@hadley) has a package to make this a bit easier. It is called profr. And as Dirk reminds me in the comments, there is also Luke Tierney's proftools package.

Edit: as the OP provided some test data I knocked up something quick to show the speed-up achieved by just following good loop practice:

smoothingEpisodes2 <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]
    nr <- nrow(theData)
    out1 <- integer(length = nr)
    out2 <- out3 <- numeric(length = nr)
    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]
        if (curId != nextId | (curEnd + 1) < nextStart) {
            out1[i-1] <- curId
            out2[i-1] <- curStart
            out3[i-1] <- curEnd
            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    out1[i] <- curId
    out2[i] <- curStart
    out3[i] <- curEnd
    theOutput <- data.frame(ID = out1,
                            START = as.Date(out2, origin = "1970-01-01"),
                            END = as.Date(out3, origin = "1970-01-01"))
    ## drop empty
    theOutput <- theOutput[-which(theOutput$ID == 0), ]
    theOutput
}

Using the test dataset provide in object testData, I get:

> res1 <- smoothingEpisodes(testData)
> system.time(replicate(100, smoothingEpisodes(testData)))
   user  system elapsed 
  1.091   0.000   1.131 
> res2 <- smoothingEpisodes2(testData)
> system.time(replicate(100, smoothingEpisodes2(testData)))
   user  system elapsed 
  0.506   0.004   0.517

a 50% speed up. Not dramatic but simple to achieve just by not growing an object at each iteration.

这篇关于“平滑化”时间数据 - 可以做到更高效吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆