根据R中的移动时间窗口加入数据 [英] joining data based on a moving time window in R

查看:107
本文介绍了根据R中的移动时间窗口加入数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有每小时记录的天气数据,以及每4小时记录的位置数据(X,Y)。我想知道位置X,Y处的温度。天气数据不完全是在同一时间。所以,我为每个位置写了这个循环,以扫描在Date / TIME中查找最接近的天气数据,并从中提取数据。问题是我写的方式,对于位置#2,它扫描天气数据,但不允许分配给位置#1的最近的时间信息。说位置#1& 2点在下午6点和6点10分钟内10分钟内,最接近的天气时间是6点。我不能让它允许天气数据在下午6点作为选择。我这样设置,因为200位置到我的位置数据集(比如说3个月),我不希望它从时间0开始从天气数据,当我知道最接近的天气数据刚刚计算最后一个位置也恰好在该数据集中3个月。以下是一些示例数据和我的代码。我不知道这是否有意义。

 < h6> ####位置数据< / h6> 

< p> X Y DateTime< br />
1 2 4/2/2003 18:01:01
3 2 4/4/2003 17:01:33
2 3 4/6/2003 16:03:07
5 6 4/8/2003 15:03:08
3 7 4/10/2003 14:03:06
4 5 4/2/2003 13:02:00
4 5 4/4/2003 12:14:43
4 3 4/6/2003 11:00:56
3 5 4/8/2003 10:02:06< / p>

< h2> 2 4 4/10/2003 9:02:19< / h2>

< p>天气数据
DateTime WndSp WndDir Hgt
4/2/2003 17:41:00 8.17 102.86 3462.43
4/2/2003 20: 00:00 6.70 106.00 17661.00
4/2/2003 10:41:00 6.18 106.00 22000.00
4/2/2003 11:41:00 5.78 106.00 22000.00
4/2/2003 12 :41:00 5.48 104.00 22000.00
4/4/2003 17:53:00 7.96 104.29 6541.00
4/4/2003 20:53:00 6.60 106.00 22000.00
4/4/2003 19:41:00 7.82 105.00 7555.00
4/4/2003 7:41:00 6.62 105.00 14767.50
4/4/2003 8:41:00 6.70 106.00 17661.00
4/4 / 2003 9:41:00 6.60 106.00 22000.00
4/5/2003 20:41:00 7.38 106.67 11156.67
4/6/2003 18:07:00 7.82 105.00 7555.00
4/6 / 2003 21:53:00 6.18 106.00 22000.00
4/6/2003 21:41:00 6.62 105.00 14767.50
4/6/2003 4:41:00 7.96 104.29 6541.00
4 / 6/2003 5:41:00 7.82 105.00 7555.00
4/6/2003 6:41:00 7.38 106.67 11156.67
4/8/2003 18:53:00 7.38 106.67 11156.67
4/8/2003 22:53:00 5.78 106.00 22000.00
4/8/2003 1:41:00 5.78 106.00 22000.00
4/8/2003 2:41:00 5.48 104.00 22000.00
4/8/2003 3:41:00 8.17 102.86 3462.43
4/10/2003 19:53:00 6.62 105.00 14767.50
4/10/2003 23:53: 00 5.48 104.00 22000.00
4/10/2003 22:41:00 6.70 106.00 17661.00
4/10/2003 23:41:00 6.60 106.00 22000.00
4/10/2003 0:41 :00 6.18 106.00 22000.00
4/11/2003 17:41:00 8.17 102.86 3462.43< / p>

< h2> 4/12/2003 18:41:00 7.96 104.29 6541.0< / h2>

  weathrow = 1 
for(i in 1:nrow(SortLoc)){
t = 0
while(t< 1){
timedif1 = difftime(SortLoc $ DateTime [i],SortWeath $ DateTime [weathrow],units =auto)
timedif2 = difftime(SortLoc $ DateTime [i],SortWeath $ DateTime [weathrow + 1],units =auto b $ b if(timedif2< 0){
if(abs(timedif1)< abs(timedif2)){
SortLoc $ WndSp [i] = SortWeath $ WndSp [weathrow]
SortLoc $ WndDir [i] = SortWeath $ WndDir [weathrow]
SortLoc $ Hgt [i] = SortWeath $ Hgt [weathrow]
} else {
SortLoc $ WndSp [i] = SortWeath $ WndSp [weathrow + 1]
SortLoc $ WndDir [i] = SortWeath $ WndDir [weathrow + 1]
SortLoc $ Hgt [i] = SortWeath $ Hgt [weathrow + 1]
}
t = 1
}
if(abs(SortLoc $ DateTime [i] - SortLoc $ DateTime [i + 1]< 50)){
weathrow = weathrow
} else {
weathrow = weathrow + 1
#if(weathrow = nrow(SortWeath)){t = 1}
}
} #end而
}


解决方案

您可以使用 findInterval 函数找到最近的值:

 #示例数据:
x < - rnorm(120000)
y< - rnorm(71000)
y< - sort(y)#第二个向量必须排序
id < - findInterval(x,y,all .inside = TRUE)#查找最后一个y的位置小于x
id_min < - ifelse(abs(xy [id])< abs(xy [id + 1]),id,id + 1)找到最近的

在你的情况下,一些 as.numeric 可能需要。

 #假定SortWeath被排序,如果不是SortWeath < -  SortWeath [order(SortWeath $ DateTime),] 
x< - as.numeric(SortLoc $ DateTime)
y< - as.numeric(SortWeath $ DateTime)
id < - findInterval(x,y,all。 insi de = TRUE)
id_min< - ifelse(abs(xy [id])< abs(xy [id + 1]),id,id + 1)
SortLoc $ WndSp< - SortWeath $ WndSp [id_min]
SortLoc $ WndDir< - SortWeath $ WndDir [id_min]
SortLoc $ Hgt< - SortWeath $ Hgt [id_min]
pre>




有些添加:您应该永远不要,绝对NEWER 将值添加到 data.frame 在for-loop中。检查这个比较:

  N = 1000 
x< - numeric(N)
X< data.frame(x = x)
require(rbenchmark)
benchmark(
vector = {for(i in 1:N)x [i] <-1},
data.frame = {for(i in 1:N)X $ x [i] <-1}

#测试复制经过相对
#2 data.frame 100 4.32 22.74
#1矢量100 0.19 1.00

data.frame 版本的速度提高了20倍以上,如果更多的行包含,则差异越大。



所以如果你改变你的脚本,首先初始化结果向量:

  tmp_WndSp<  -  tmp_WndDir<  -  tmp_Hg<  -  rep(NA,nrow(SortLoc))

然后更新循环中的值

  tmp_WndSp [i]<  -  SortWeath $ WndSp [weathrow + 1] 
#等等...

最后(循环外)更新正确的列:

  SortLoc $ WndSp<  -  tmp_WndSp 
SortLoc $ WndDir< - tmp_WndDir
SortLoc $ Hgt < - tmp_Hgt

应该运行得更快。


I have weather data that was recorded every hour, and location data (X,Y) that was recorded every 4 hours. I want to know what the temperature was at location X,Y. The weather data isn't exactly at the same time. So, I have written this loop for every location to scan through the weather data looking for the "closest" in Date/TIME and extracting the data from that time. The problem is the way Ive written it, for location #2, it scans through the weather data but will not allow the closest time information to be assigned that was assigned for location#1. Say location #1 & 2 are taken within 10 minutes at 6pm and 6:10pm, the closest weather time is 6pm. I can't get it to allow the weather data at 6pm as an option. I kind of set it up like this because 200 locations into my location data set (say 3 months into it), I do not want it starting at time 0 from the weather data, when I know that the closest weather data was just calculated for the last location and that happens to be 3 months into that data set too. Below is some sample data and my code. I don't know if this makes sense.

<h6>####Location data</h6>

<p>X   Y   DateTime <br />
1   2   4/2/2003    18:01:01
3   2   4/4/2003    17:01:33
2   3   4/6/2003    16:03:07
5   6   4/8/2003    15:03:08
3   7   4/10/2003   14:03:06
4   5   4/2/2003    13:02:00
4   5   4/4/2003    12:14:43
4   3   4/6/2003    11:00:56
3   5   4/8/2003    10:02:06</p>

<h2>2   4   4/10/2003   9:02:19</h2>

<p>Weather Data
DateTime        WndSp   WndDir  Hgt
4/2/2003 17:41:00   8.17    102.86  3462.43
4/2/2003 20:00:00   6.70    106.00  17661.00
4/2/2003 10:41:00   6.18    106.00  22000.00
4/2/2003 11:41:00   5.78    106.00  22000.00
4/2/2003 12:41:00   5.48    104.00  22000.00
4/4/2003 17:53:00   7.96    104.29  6541.00
4/4/2003 20:53:00   6.60    106.00  22000.00
4/4/2003 19:41:00   7.82    105.00  7555.00
4/4/2003 7:41:00    6.62    105.00  14767.50
4/4/2003 8:41:00    6.70    106.00  17661.00
4/4/2003 9:41:00    6.60    106.00  22000.00
4/5/2003 20:41:00   7.38    106.67  11156.67
4/6/2003 18:07:00   7.82    105.00  7555.00
4/6/2003 21:53:00   6.18    106.00  22000.00
4/6/2003 21:41:00   6.62    105.00  14767.50
4/6/2003 4:41:00    7.96    104.29  6541.00
4/6/2003 5:41:00    7.82    105.00  7555.00
4/6/2003 6:41:00    7.38    106.67  11156.67
4/8/2003 18:53:00   7.38    106.67  11156.67
4/8/2003 22:53:00   5.78    106.00  22000.00
4/8/2003 1:41:00    5.78    106.00  22000.00
4/8/2003 2:41:00    5.48    104.00  22000.00
4/8/2003 3:41:00    8.17    102.86  3462.43
4/10/2003 19:53:00  6.62    105.00  14767.50
4/10/2003 23:53:00  5.48    104.00  22000.00
4/10/2003 22:41:00  6.70    106.00  17661.00
4/10/2003 23:41:00  6.60    106.00  22000.00
4/10/2003 0:41:00   6.18    106.00  22000.00
4/11/2003 17:41:00  8.17    102.86  3462.43</p>

<h2>4/12/2003 18:41:00  7.96    104.29  6541.0</h2>

.

weathrow = 1
for (i in 1:nrow(SortLoc)) {
    t = 0
    while (t < 1) {
        timedif1 = difftime(SortLoc$DateTime[i], SortWeath$DateTime[weathrow], units="auto")
        timedif2 =  difftime(SortLoc$DateTime[i], SortWeath$DateTime[weathrow+1], units="auto") 
        if (timedif2 < 0) {
            if (abs(timedif1) < abs(timedif2)) {
                SortLoc$WndSp[i]=SortWeath$WndSp[weathrow]
                SortLoc$WndDir[i]=SortWeath$WndDir[weathrow]
                SortLoc$Hgt[i]=SortWeath$Hgt[weathrow]
            } else {
                SortLoc$WndSp[i]=SortWeath$WndSp[weathrow+1]
                SortLoc$WndDir[i]=SortWeath$WndDir[weathrow+1]
                SortLoc$Hgt[i]=SortWeath$Hgt[weathrow+1]
            }
            t = 1
        }
        if (abs(SortLoc$DateTime[i] - SortLoc$DateTime[i+1] < 50)) {
            weathrow=weathrow
        } else {
            weathrow = weathrow+1
            #if(weathrow = nrow(SortWeath)){t=1}
        }
    } #end while
}

解决方案

You could use findInterval function to find nearest value:

# example data:
x <- rnorm(120000)
y <- rnorm(71000)
y <- sort(y) # second vector must be sorted
id <- findInterval(x, y, all.inside=TRUE) # finds position of last y smaller then x
id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1) # to find nearest

In your case some as.numeric might be needed.

# assumed that SortWeath is sorted, if not then SortWeath <- SortWeath[order(SortWeath$DateTime),]
x <- as.numeric(SortLoc$DateTime)
y <- as.numeric(SortWeath$DateTime)
id <- findInterval(x, y, all.inside=TRUE)
id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1)
SortLoc$WndSp  <- SortWeath$WndSp[id_min]
SortLoc$WndDir <- SortWeath$WndDir[id_min]
SortLoc$Hgt    <- SortWeath$Hgt[id_min]


Some addition: you should never, ABSOLUTELY NEWER add values to data.frame in for-loop. Check this comparison:

N=1000
x <- numeric(N)
X <- data.frame(x=x)
require(rbenchmark)
benchmark(
    vector = {for (i in 1:N) x[i]<-1},
    data.frame = {for (i in 1:N) X$x[i]<-1}
)
#         test replications elapsed relative
# 2 data.frame          100    4.32    22.74
# 1     vector          100    0.19     1.00

data.frame version is over 20 times slower, and if more rows it contain then difference is bigger.

So if you change you script and first initialize result vectors:

tmp_WndSp <- tmp_WndDir <- tmp_Hg <- rep(NA, nrow(SortLoc))

then update values in loop

tmp_WndSp[i] <- SortWeath$WndSp[weathrow+1]
# and so on...

and at the end (outside the loop) update proper columns:

SortLoc$WndSp <- tmp_WndSp
SortLoc$WndDir <- tmp_WndDir
SortLoc$Hgt <- tmp_Hgt

It should run much faster.

这篇关于根据R中的移动时间窗口加入数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆