SQL查询按内部连续距离低于阈值的列的最大集合分组 [英] SQL query to group by maximal sets of a column having inner consecutive distances below a threshold

查看:172
本文介绍了SQL查询按内部连续距离低于阈值的列的最大集合分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

增加的要求使问题复杂化。见下文

我有一个包含2列 time location的表

╔════════╦══════════╗
║ time   ║ location ║
╠════════╬══════════╣
║ 9:10   ║  1       ║
║ 9:20   ║ 3        ║
║ 9:35   ║ 3        ║
║ 9:40   ║ 3        ║
║ 10:10  ║  2       ║
║ 10:20  ║  2       ║
║ 10:40  ║  2       ║
║ 11:30  ║ 7        ║
║ 11:50  ║  9       ║
╚════════╩══════════╝

我想把它变成一张新表, 行程,如果小时内没有活动,则行程结束/ code>(=在下一小时内没有其他时间 的行)

I want to turn it into a new table depicting 'trips', where a trip is ended if there is no activity for an hour (= there is no other row with time that is within the next hour)

也就是说,每次行程都会有一行,并计算出一些值(总步行距离,总行程时间,

That is, there will be a row for each trip with some values calculated (total walking distance, total trip time, etc.)

例如:

╔═════════╦════════════════════════╗
║ trip_id ║ total_walking_distance ║
╠═════════╬════════════════════════╣
║ 1       ║ 3 (=|1-3|+|3-2|)       ║
║ 2       ║ 7(=|2-7|+|7-9|)        ║
╚═════════╩════════════════════════╝

所以我想'按时间分组',但不是按不同的时间值,而是按更复杂的情况。

So I want to 'group by time', but not by distinct values of time but by a more complicated condition. Is it possible without resorting to procedural languages?

按时间排序似乎使问题变得更容易,因为我们只需要知道连续的行何时相隔一个小时以上,

Sorting by time seems to make the problem easier, because we only need to know when consecutive rows are more than an hour apart, but after sorting it's still not clear how to make this into a query.

在一种编程语言中,它只是按时间排序,然后按顺序遍历所有行,当时间差> 1时,我们从当前行程的起点一直到当前行,并进行标定,然后将起点重新初始化到下一行。据我所知,这在SQL语言中是不可能的。

In a programming language it would be just to sort by time and then go sequentially over the rows, and when the time difference is >1 we look from the start of the current trip up to the current row and do our cacluations, and reinitialize the start to the next row. This is not possible, as far as I know, in SQL languages.

如果一些连续的行具有相同的位置,我只想对它们计数一次:如果一个连续许多行都留在同一位置,那么结果中应该只有第一个。

If some time-consecutive rows have the same location, I want to only count them once: if one stays in the same place for many consecutive rows only the first one should be in the result.

这是通过首先添加lag()列的时间和位置,并删除相同位置或相同时间的时间连续行,然后进行天真分区。

This was done by first adding the lag() columns of time and location, and removing time-consecutive rows of the same location or the same time, and then partitioning naively.

(类似于此查询:)

select      time,location
into cleaned_from_duplicate_time_loc
from            (select info.*,
                time - lag(time) over (partition by id order by time)    as diff_time,
                loc- lag(loc) over (partition by id order by time)       as diff_loc
                from info)  
                with_consecutive_differences_of_location_and_time
where (diff_loc is null or diff_loc<>0) and (diff_time>interval '0 hour' or diff_time is null)


select with_trip_start_boolean.*, sum(is_start_of_trip) over(order by id, time) as trip_id
 into with_trip_id
from
    (SELECT auxiliary_table_with_lag_diffs.*,
            case when diff_time> interval '1 hour' or diff_time is null then 1 else 0 end as is_start_of_trip
    FROM 
        (   --adding time_diffs for each id separately
          select cleaned_from_duplicate_time_loc.*,
            time - lag(time) over (partition by id order by time) as diff_time
          from cleaned_from_duplicate_time_loc
        )
     auxiliary_table_with_lag_diffs

    ORDER BY id, time) 
    with_trip_start_boolean

但是,有一个复杂的错误使之成为错误:如果一个人呆在同一个位置一段时间,然后继续移动,并且从静止不动的最后一行到下一行不到小时 ,那么我们应该在结果中对固定行进行两次计数,既作为上一行程的结束,又作为下一行程的开始。这样一来,就不可能完全采用清洗优先的方法,因为如果最后一个重复的行距其后继者不到一个小时的话,我们会丢失其必要的信息。

However, there is a complication that makes it wrong: if one stays in the same location for some time and then continues to move and there is less than an hour from the last row in which he was stationary to the next row, then we should count the stationary row twice in the result, both as the end of the previous trip and as the start of the next trip. That makes it impossible to do the cleaning-first approach altogether, because we lose the necessary information of the last 'duplicated' row in case it is less than an hour away from its successor.

在重复的连续位置开始和结束旅行的情况下,我应该怎么做才能正确对待?

What should I do to treat correctly the case where duplicate successive locations both start and end a trip?

推荐答案

基本上,您想要的是带有某些日期算术的 lag()函数。具体来说,您想要新行程开始的条件的累积总和:

Basically, what you want is the lag() function with some date arithmetic. Specifically, you want a cumulative sum of the condition that a new trip starts:

select row_number() over (order by min(time)) as trip_num,
       min(time) as trip_start, max(time) as trip_end,
       count(*) as num_stops,
       count(distinct location) as num_locations
from (select sum(case when time > prev_time + interval '1 hour' then 1 else 0 end) over
                 (order by time) as grp
      from (select t.*,
                   lag(time) over (order by time) as prev_time
            from table t
           ) t
     ) t
group by grp;

这篇关于SQL查询按内部连续距离低于阈值的列的最大集合分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆