将带有间隔阈值的可变连续时间戳记录分组 [英] Grouping variably contiguous timestamped records with spacing threshold

查看:89
本文介绍了将带有间隔阈值的可变连续时间戳记录分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一系列带有时间戳的间歇性GPS坐标.我正在使用PostGIS将其渲染到地图画布上.要渲染它们,需要使用PostGIS中的ST_MakeLine()聚合函数将这些点聚合为线,​​从而在地图上留下缺少GPS数据的空隙.数据不一定是从设备顺序到达的.

I have a sequence of intermittently spaced GPS coordinates with timestamps. I'm using PostGIS to render them onto a map canvas. To render them, the points need to be aggregated into lines using the ST_MakeLine() aggregate function in PostGIS, leaving gaps on the map where GPS data is missing. The data does not necessarily arrive in order from the devices.

示例序列如下:

ID | Timestamp              | Location
--------------------------------------
1  | 2013-11-12 03:31:31    | (1,2)   
3  | 2013-11-12 03:31:34    | (1,3)   
7  | 2013-11-12 03:31:37    | (1,4)  
4  | 2013-11-12 03:31:43    | (1,5)   
2  | 2013-11-12 03:31:55    | (1,6)   
16 | 2013-11-12 03:33:22    | (1,7)   
22 | 2013-11-12 03:33:28    | (1,8)   
18 | 2013-11-12 03:33:32    | (1,9)   

分组条件为:

  • 如果与上一个记录的差距是> 30秒或
  • 如果自该组中第一条记录以来的时间< 15秒在这种情况下,该点属于两个组(即,一个组以该点结束,下一个组开始)

PostGIS中的ST_MakeLine()函数将产生必要的行,问题是正确地对行进行了分组.

The ST_MakeLine() function in PostGIS will produce the necessary line, the problem is properly grouping the lines.

基于此,以上代码会产生:

Based on that, the above would produce:

Start               | End                 | ST_MakeLine(?)
----------------------------------------------------------------------------
2013-11-12 03:31:31 | 2013-11-12 03:31:43 | LINE((1,2),(1,3),(1,4),(1,5))
2013-11-12 03:31:43 | 2013-11-12 03:31:55 | LINE((1,5),(1,6))
2013-11-12 03:33:22 | 2013-11-12 03:33:32 | LINE((1,7),(1,8),(1,9))

这似乎是对大多数其他连续选择"问题所引用的孤岛和空白"问题的一种变体,但有一个扭曲,即排序不规则,因此,这些解决方案似乎不适用.

This seems to be a variation on the "island and gaps" problem referenced by most other "contiguous select" questions, but with the twist that the sequencing is not regular, and thus, those solutions don't seem to apply.

我目前正在处理SQL外部的数据以生成序列,但这会导致多次往返,如果可以的话,我希望避免这种情况.

I'm currently processing the data outside SQL to generate the sequences, but that incurs several round-trips that I'd like to avoid if I can.

SQLFiddle示例数据: http://sqlfiddle.com/#!15/1ff93/7

SQLFiddle of the sample data: http://sqlfiddle.com/#!15/1ff93/7

推荐答案

我最终采取了两部分的方法来解决这个问题:

I ended up taking a two-part approach to solve this:

  • 一个存储过程,该过程在每行后附加一个组ID"
  • 简单的汇总查询

性能明显优于在数据库外部进行(45s vs 2.8s)

Performance is significantly better than doing it external to the database (45s vs 2.8s)

因此,给定一个由以下内容创建的表:

So, given a table created by the following:

CREATE TABLE locations (
  id SERIAL PRIMARY KEY,
  ts TIMESTAMP WITHOUT TIME ZONE,
  location GEOMETRY(Point,4326)
);

以下函数将遍历表并将组ID"附加到每行:

The following function will iterate over the table and append a "group id" to each row:

CREATE FUNCTION group_locations(
  IN scan_start_time TIMESTAMP WITHOUT TIME ZONE,
  IN max_time_gap INTERVAL, 
  IN max_line_duration INTERVAL)
RETURNS TABLE(
  out_geom GEOMETRY, 
  out_ts TIMESTAMP WITHOUT TIME ZONE, 
  out_group_id INTEGER) AS
$BODY$
DECLARE
  r locations%ROWTYPE;
  gid INTEGER;
  lastts TIMESTAMP;
  startts TIMESTAMP;
BEGIN
  gid := 0;
  lastts := NULL;
  startts := NULL;

  FOR r IN 
    SELECT * FROM locations 
    WHERE ts > scan_start_time
    ORDER BY ts ASC
  LOOP
    out_ts := r.ts;
    out_geom := r.location;
    out_group_id := gid;

    IF startts IS NULL OR lastts IS NULL THEN
      startts := r.ts;
    ELSIF r.ts - lastts >= max_time_gap THEN
      -- If we've hit a space in our data, bump the group id up
      -- and remember the start time for this group
      gid := gid+1;
      out_group_id = gid;
      startts := r.ts;
    ELSIF r.ts - startts >= max_line_duration THEN
      -- First, emit the current row to end the group
      RETURN NEXT;
      -- Then, bump the group id and start time, we will
      -- re-emit the same row with a higher group_id below
      gid := gid+1;
      out_group_id := gid;
      startts := r.ts;
    END IF;
    -- Emit the current row with the group_id appended
    RETURN NEXT;
    lastts := r.ts;
  END LOOP;
  RETURN;
END;
$BODY$
LANGUAGE plpgsql VOLATILE;

如果运行我的示例数据,则结果为:

If run over my example data, the result is:

out_ts              | out_geom | out_group_id
---------------------------------------------
2013-11-12 03:31:31 | (1,2)    | 0
2013-11-12 03:31:34 | (1,3)    | 0
2013-11-12 03:31:37 | (1,4)    | 0
2013-11-12 03:31:43 | (1,5)    | 0
2013-11-12 03:31:43 | (1,5)    | 1
2013-11-12 03:31:55 | (1,6)    | 1
2013-11-12 03:33:22 | (1,7)    | 2
2013-11-12 03:33:28 | (1,8)    | 2
2013-11-12 03:33:32 | (1,9)    | 2

然后,该过程的输出可以简单地分组和汇总:

Then, the output of this procedure can be simply grouped and aggregated:

SELECT ST_Makeline(out_geom) AS geom,MIN(out_ts) AS start,MAX(out_ts) AS finish
FROM group_locations(
       NOW() AT TIME ZONE 'UTC' - '10 days'::INTERVAL,  -- how far back to look
       '30 seconds'::INTERVAL,  -- maximum gap allowed before creating a break
       '15 seconds'::INTERVAL  -- maximum duration allowed before forcing a break
)
GROUP BY out_group_id;

该函数执行速度非常快,至少比在外部执行相同的逻辑好一个数量级.缺点是没有对结果进行索引,因此在进一步的查询中直接使用它们并不是特别有效.它的运行时间约为O(2N),第一次扫描将附加组ID,然后第二次扫描进行汇总.

The function executes quite quickly, at least an order of magnitude better than doing the same logic externally. The downside is that the results are not indexed, so directly using them in further queries is not particularly performant. It runs in about O(2N) time, the first scan to append the group ID, then the second scan to aggregate.

我的最终解决方案每隔几分钟执行一次以上操作,以刷新已完全索引的"calculated_tracks"表.

My final solution executes the above every couple of minutes to refresh a "calculated_tracks" table which is fully indexed.

这篇关于将带有间隔阈值的可变连续时间戳记录分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆