R:将观察值除以并汇总到时间间隔 [英] R: Split observation values by and aggregate to time intervals

查看:116
本文介绍了R:将观察值除以并汇总到时间间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在某些区域(名称)上有来自各个观测点( obs )的鸟类观测.花费了开始结束时间,并使用校正因子重新计算了时差( diff_corr ),因此,这不仅仅是difftime开始-结束间隔的时间.

There are bird observations from various observation points (obs) over certain areas (name). The start and end time was taken, and the time difference (diff_corr) recalculated with a correction factor, so it is not simply difftime of the start-end-interval.

我现在需要将这些值拆分"为不错"的间隔(15分钟,例如10:15:00、10:30:00等),然后按区域汇总(名称),以便能够在15分钟的间隔内绘制出鸟类在这些区域中的存在情况.

I now need to "split" these values to "nice" intervals (15 minutes, e.g. 10:15:00, 10:30:00, ...) and then aggregate area-wise(name) in order be able to make a plot of the presence of birds on those areas in those clean 15-minute-intervals.

因此,为了更清楚一点:观察可能始于10:14,一直持续到10:25,所以它跨越了10:00-10:15和10:15-10:30的时间间隔,因此应该将我得到的值除以相应的间隔,并根据它们在该间隔中所占的比例分配给相应的间隔.

So, to make it a little more clear: An observation might start at 10:14 and goes till 10:25, so it spans over the interval 10:00-10:15 and 10:15-10:30, so the value I got should be split and assigned accordingly to the appropriate intervals by the part they have into that interval.

在更复杂的设置中,观察值可能跨越3或4个间隔,因此也必须在该位置相应地拆分值.

In a more complicated setting, an observation might span over 3 or 4 intervals, and so the value has to be split there accordingly as well.

最后一步是按时间间隔汇总所有观察部分并绘制它们.

The last step would be to aggregate all observation parts per interval and plot them.

我已经搜索了几天的解决方案,但只发现了非常简单的示例,其中使用cutbreaks重新排列了间隔,但从未示例如何处理关联的值,而是简单的频率计数.

I already searched for solutions for some days, but only found very simplistic examples where intervals were rearranged with cut and breaks, but never examples what to do with associated values, but simple frequency counts.

示例数据:

structure(list(obs = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("b", 
"C2", "Dürnberg2"), class = "factor"), name = c("C2", "C2", 
"C2", "C2", "C2", "C2", "C2", "C2", "C2", "b", "981", "1627", 
"b", "b", "981", "1627", "b", "b", "b", "b"), start = structure(c(1495441500, 
1495441590, 1495441650, 1495441680, 1495447380, 1495447410, 1495447530, 
1495447560, 1495447580, 1496996580, 1496996580, 1496996580, 1496996760, 
1496996820, 1496996820, 1496996820, 1496997180, 1496997300, 1496997420, 
1496998260), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1495441590, 
1495441650, 1495441680, 1495441800, 1495447410, 1495447530, 1495447560, 
1495447580, 1495447620, 1496996760, 1496996760, 1496996760, 1496996820, 
1496997180, 1496997180, 1496997180, 1496997300, 1496997420, 1496997540, 
1496998320), class = c("POSIXct", "POSIXt"), tzone = ""), diff_corr = c(1.46739130434783, 
0.978260869565217, 0.489130434782609, 1.95652173913043, 0.489130434782609, 
1.95652173913043, 0.489130434782609, 0.326086956521739, 0.652173913043478, 
2.96703296703297, 2.96703296703297, 2.96703296703297, 0.989010989010989, 
5.93406593406593, 5.93406593406593, 5.93406593406593, 1.97802197802198, 
1.97802197802198, 1.97802197802198, 0.989010989010989)), .Names = c("obs", 
"name", "start", "end", "diff_corr"), row.names = c("1", "9", 
"7", "8", "3", "2", "4", "5", "6", "13", "13.1", "13.2", "22", 
"11", "11.1", "11.2", "12", "23", "15", "16"), class = "data.frame")

p.s.我确实很难正确地命名我的问题,因此任何提示(不仅是关于此的提示)都受到高度赞赏

p.s. I have real difficulties to name my question properly, so any hints (not only on that) are highly appreciated

一个小例子的新尝试: 按比例将值分配给间隔(然后求和等于间隔)

New attempt for a small example: Assigning the value to intervals by their proportion (and later sum up equal intervals)

start         end         value     new values in new 15-min-intervals
10:03:00      10:14:00    11        ---> 10:00:00 =  11
10:14:00      10:16:00     2        ---> 10:00:00 = 1 ; 10:15:00 = 1
10:00:00      10:35:00    40        ---> 10:00:00 = 40/35*15 ; 10:15:00 = 40/35*15 ; 10:30:00 = 40/35*5
10:15:00      10:30:00    12        ---> 10:15:00 = 12

推荐答案

这是一种data.table方法,它允许您使用SQL类型的查询来排序/过滤数据并执行操作.

Here's a data.table approach which allows you to use SQL-type queries to sort/filter data and perform operations.

数据

> p
    obs name               start                 end diff_corr
 1:  C2   C2 2017-05-22 04:25:00 2017-05-22 04:26:30 1.4673913
 2:  C2   C2 2017-05-22 04:26:30 2017-05-22 04:27:30 0.9782609
 3:  C2   C2 2017-05-22 04:27:30 2017-05-22 04:28:00 0.4891304
 4:  C2   C2 2017-05-22 04:28:00 2017-05-22 04:30:00 1.9565217
 5:  C2   C2 2017-05-22 06:03:00 2017-05-22 06:03:30 0.4891304
 6:  C2   C2 2017-05-22 06:03:30 2017-05-22 06:05:30 1.9565217
 7:  C2   C2 2017-05-22 06:05:30 2017-05-22 06:06:00 0.4891304
 8:  C2   C2 2017-05-22 06:06:00 2017-05-22 06:06:20 0.3260870
 9:  C2   C2 2017-05-22 06:06:20 2017-05-22 06:07:00 0.6521739
10:   b    b 2017-06-09 04:23:00 2017-06-09 04:26:00 2.9670330
11:   b  981 2017-06-09 04:23:00 2017-06-09 04:26:00 2.9670330
12:   b 1627 2017-06-09 04:23:00 2017-06-09 04:26:00 2.9670330
13:   b    b 2017-06-09 04:26:00 2017-06-09 04:27:00 0.9890110
14:   b    b 2017-06-09 04:27:00 2017-06-09 04:33:00 5.9340659
15:   b  981 2017-06-09 04:27:00 2017-06-09 04:33:00 5.9340659
16:   b 1627 2017-06-09 04:27:00 2017-06-09 04:33:00 5.9340659
17:   b    b 2017-06-09 04:33:00 2017-06-09 04:35:00 1.9780220
18:   b    b 2017-06-09 04:35:00 2017-06-09 04:37:00 1.9780220
19:   b    b 2017-06-09 04:37:00 2017-06-09 04:39:00 1.9780220
20:   b    b 2017-06-09 04:51:00 2017-06-09 04:52:00 0.9890110

代码

library(data.table)
library(lubridate)
p <- as.data.table(p)
p[, .(new_diff = mean(diff_corr)), .(tme_start = round_date(start, unit = "15min"))]

输出

> p[, .(new_diff = mean(diff_corr)), .(tme_start = round_date(start, unit = "15min"))]
             tme_start  new_diff
1: 2017-05-22 04:30:00 1.2228261
2: 2017-05-22 06:00:00 0.7826087
3: 2017-06-09 04:30:00 3.3626374
4: 2017-06-09 04:45:00 0.9890110

Data.Table在做什么?

由于您不熟悉data.table,因此这里是对正在发生的事情的非常简单的基本描述. data.table调用的一般形式是:

Since you aren't familiar with data.table, here's a very quick, elementary description of what is happening. General form of the data.table call is:

DT[select rows, perform operations, group by] 

其中DTdata.table名称. Select rows是逻辑运算,例如假设您只希望观察C2(名称),则呼叫为DT[name == "C2",].不需要执行任何操作,也不需要分组.如果要所有name == "C2"diff_corr列之和,则调用将变为DT[name == "C2", list(sum(diff_corr))].您可以使用.()代替编写list().现在,输出将只有一行和一列,称为V1,这是name == "C2"时所有diff_corr的总和.该列没有很多信息,因此我们为它分配一个名称(可以与旧名称相同):DT[name == "C2", .(diff_corr_sum = sum(diff_corr))].假设您还有一个名为"mood"的列,该列报告了进行观察的人的心情,并且可以假设三个值("happy","sad","sleepy").您可以分组"心情:DT[name == "C2", .(diff_corr_new = sum(diff_corr)), by = .(mood)].输出将是对应于每种心情的三行和一列diff_corr_new.为了更好地理解这一点,请尝试使用诸如mtcars之类的样本数据集.您的样本数据没有足够的复杂性等,因此您无法探索所有这些功能.

Where DT is the data.table name. Select rows is a logical operation e.g. say you want only observations for C2 (name), the call would be DT[name == "C2",] There is no operation required to be performed and no grouping. If you want the sum of diff_corr column for all name == "C2", the call becomes DT[name == "C2", list(sum(diff_corr))]. Instead of writing list() you can use .(). The output will now have a only one row and one column called V1 which is the sum of all diff_corr when name == "C2". The column doesn't have a lot of information so we assign it a name (can be the same as the old one): DT[name == "C2", .(diff_corr_sum = sum(diff_corr))]. Suppose you had another column called "mood" which reported the mood of the person making the observation and can assume three values ("happy", "sad", "sleepy"). You could "group by" the mood: DT[name == "C2", .(diff_corr_new = sum(diff_corr)), by = .(mood)]. The output would be three rows corresponding to each of the moods and one column diff_corr_new. To understand this better try playing around with a sample dataset like mtcars. Your sample data doesn't have enough complexity etc. to allow you to explore all of these functions.

返回答案-其他版本

从问题或注释中不清楚您是否要基于startend进行四舍五入.我使用了前者,但您可以更改它.上面的示例使用mean,但是您可以执行可能需要的任何其他操作.其他列似乎或多或少是多余的,因为它们是字符串,您不能对它们做太多事情.您可以使用它们在by条目(代码的最后一个字段)中进一步对结果进行排序.以下是分别使用obsname的两个示例.您也可以将它们全部组合在一起.

It's not clear from the question or comments if you want to round based on start or end. I used the former but you can change that. The example above uses mean but you can perform any other operations you may need. The other columns seem more or less redundant since they are strings and you can't do much with them. You could use them to further sort the results in the by entry (last field in the code). Below are two examples using obs and name respectively. You can also combine all of them together.

> p[, .(new_diff = mean(diff_corr)), .(tme_start = round_date(start, unit = "15min"), obs)]
             tme_start obs  new_diff
1: 2017-05-22 04:30:00  C2 1.2228261
2: 2017-05-22 06:00:00  C2 0.7826087
3: 2017-06-09 04:30:00   b 3.3626374
4: 2017-06-09 04:45:00   b 0.9890110


> p[, .(new_diff = mean(diff_corr)), .(tme_start = round_date(start, unit = "15min"), name)]
             tme_start name  new_diff
1: 2017-05-22 04:30:00   C2 1.2228261
2: 2017-05-22 06:00:00   C2 0.7826087
3: 2017-06-09 04:30:00    b 2.6373626
4: 2017-06-09 04:30:00  981 4.4505495
5: 2017-06-09 04:30:00 1627 4.4505495
6: 2017-06-09 04:45:00    b 0.9890110

这篇关于R:将观察值除以并汇总到时间间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆