PostgreSQL查询删除具有重叠时间的记录,同时保留最早的记录? [英] PostgreSQL query to delete records with overlapping times while preserving the earliest?

查看:82
本文介绍了PostgreSQL查询删除具有重叠时间的记录,同时保留最早的记录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到一种删除时间重叠的记录的方法,但是我无法找出一种简单而优雅的方式来保留所有但其中一个重叠的记录.这个问题类似于这个问题,但有一些区别.我们的表如下所示:

I'm trying to figure out a way to delete records with overlapping times but I'm unable to figure out a simple and elegant way of keeping all but one of those records which overlap. This question is similar to this one but with a few differences. Our table looks something like:

╔════╤═══════════════════════════════════════╤══════════════════════════════════════╤════════╤═════════╗
║ id │ start_time                            │ end_time                             │ bar    │ baz     ║
╠════╪═══════════════════════════════════════╪══════════════════════════════════════╪════════╪═════════╣
║ 0  │ Mon, 18 Dec 2017 16:08:33 UTC +00:00  │ Mon, 18 Dec 2017 17:08:33 UTC +00:00 │ "ham"  │ "eggs"  ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 1  │ Mon, 18 Dec 2017 16:08:32 UTC +00:00  │ Mon, 18 Dec 2017 17:08:32 UTC +00:00 │ "ham"  │ "eggs"  ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 2  │ Mon, 18 Dec 2017 16:08:31 UTC +00:00  │ Mon, 18 Dec 2017 17:08:31 UTC +00:00 │ "spam" │ "bacon" ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 3  │ Mon, 18 Dec 2017 16:08:30 UTC +00:00  │ Mon, 18 Dec 2017 17:08:30 UTC +00:00 │ "ham"  │ "eggs"  ║
╚════╧═══════════════════════════════════════╧══════════════════════════════════════╧════════╧═════════╝

在上面的示例中,所有记录都有重叠的时间,其中<​​em> overlapping 仅表示由记录的start_timeend_time(包括)定义的时间范围覆盖或扩展了另一条记录的一部分记录.但是,对于此问题,我们不仅对那些具有重叠时间的记录感兴趣,而且对匹配的barbaz列(上面的行0、1和3)感兴趣.找到这些记录后,我们希望删除所有记录,但要最早删除,在上面的表中只保留记录2和3,因为记录2没有匹配的barbaz列,而记录3没有并且具有最早的开始和结束时间.

In the example above, all records have overlapping times where overlapping just means that the range of time defined by a record's start_time and end_time (inclusive) covers or extends over part of another record's. However, for this problem we are interested not only in those records which have overlapping times but also have matching bar and baz columns (rows 0, 1, and 3 above). After finding those records we'd like to delete all but the earliest, leaving the table above with just records 2 and 3 because record 2 does not have matching bar and baz columns and 3 does and has the earliest start and end times.

这是我到目前为止所拥有的:

Here's what I have so far:

  delete from foos where id in (
    select
      foo_one.id
    from
      foos foo_one
    where
      user_id = 42
      and exists (
        select
          1
        from
          foos foo_two
        where
          tsrange(foo_two.start_time::timestamp, foo_two.end_time::timestamp, '[]') &&
            tsrange(foo_one.start_time::timestamp, foo_one.end_time::timestamp, '[]')
          and
            foo_one.bar = foo_two.bar
          and
            foo_one.baz = foo_two.baz
          and
            user_id = 42
          and
            foo_one.id != foo_two.id
      )
  );

感谢阅读!

更新:我找到了一个对我有用的解决方案,基本上我可以将窗口函数row_number()应用到按barbaz字段分组的表分区上,然后添加DELETE语句的>子句,其中排除了第一个条目(id最小的条目).

Update: I've found a solution that works for me, basically I could apply the window function row_number() over a partition of the table that are grouped by bar and baz fields and then add a WHERE clause to the DELETE statement that excludes the first entry (the one with the smallest id).

  delete from foos where id in (
    select id from (
      select
          foo_one.id,
          row_number() over(partition by
                              bar,
                              baz
                            order by id asc)
        from
          foos foo_one
        where
          user_id = 42
          and exists (
            select
              *
            from
              foos foo_two
            where
              tsrange(foo_two.start_time::timestamp,
                        foo_two.end_time::timestamp,
                        '[]') &&
                tsrange(foo_one.start_time::timestamp,
                        foo_one.end_time::timestamp,
                        '[]')
              and
                foo_one.id != foo_two.id
          )
    ) foos where row_number <> 1
  );

推荐答案

首先,请注意:您确实应该提供更多信息.我知道您可能不想显示您业务的某些实际列,但是这样一来,您就很难理解自己想要的内容.

First of all, a small note: you really should give some more information. I understand that you probably don't want to show some real columns of your business, but in the way that it becomes a lot more hard to understand what you want to.

但是,我将提供有关该主题的一些技巧.希望对您和有类似问题的人有帮助.

But, I am going to give some tips on that subject. I hope that helps you, and whoever has a similar problem.

  1. 您需要清楚定义重叠的定义.每个人可能会有很多不同的事情.
  1. You need to be clear what you define as overlaps. That could be a lot of different things to each person.

查看以下事件:

<--a-->
    <---- b ---->
        <---- c ---->
          <-- d -->
            <---- e ---->
    <------- f -------->
                  <--- g --->

如果您像Google定义那样定义重叠:扩展以便部分覆盖,则"b","d","e"和"f" 部分与"c"事件重叠.如果像整个覆盖事件一样定义重叠,则"c"与"d"重叠,而"f"与"b","c"和"d"重叠.

If you define overlaps like the google definition: extend over so as to cover partly, then "b","d","e" and "f" overlaps partly the "c" event. If you define overlaps like the full event of covering, then "c" overlaps "d", and "f" overlaps "b" and "c" and "d".

  1. 删除组可能是个问题.在以前的情况下,我们应该怎么做?我们应该删除"b","c"和"d"并仅保留"f"吗?我们应该对它们的值求和吗?取平均数吧?因此,这是逐列做出的决定.每列的含义非常重要.因此,"bar"和"baz"对您无济于事.

  1. Deleting groups could be a problem. In that previous case, what we should do? Should we delete "b", "c" and "d" and keep just with "f"? Should we sum their values? Take the average maybe? So, this is a decision to be made, column by column. The meaning of each column is very important. So, I can't help you with "bar" and "baz".

因此,尝试猜测您真正想要的是什么,我正在创建一个具有id,begin,end和user_id的类似事件表

So, trying to guess what you really want to, I am creating a similar table of events with id, begin, end and user_id

create table events (
  id integer,
  user_id integer,
  start_time timestamp,
  end_time timestamp,
  name varchar(100)
);

我要添加示例值

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 1, 1000, timestamp('2017-10-09 01:00:00'),timestamp('2017-10-09 04:00:00'), 'a' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 2, 1000, timestamp('2017-10-09 03:00:00'),timestamp('2017-10-09 15:00:00'), 'b' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 3, 1000, timestamp('2017-10-09 07:00:00'),timestamp('2017-10-09 19:00:00'), 'c' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 4, 1000, timestamp('2017-10-09 09:00:00'),timestamp('2017-10-09 17:00:00'), 'd' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 5, 1000, timestamp('2017-10-09 17:00:00'),timestamp('2017-10-09 23:00:00'), 'e' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 6, 1000, timestamp('2017-10-09 02:30:00'),timestamp('2017-10-09 22:00:00'), 'f' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 7, 1000, timestamp('2017-10-09 17:30:00'),timestamp('2017-10-10 02:00:00'), 'g' );

现在,我们可以处理一些不错的查询:

Now, we can play with some nice queries:

列出与另一个事件完全重叠的所有事件:

List all the events that are full overlaps with another event:

select 
  # EVENT NAME
  event_1.name as event_name,
  # LIST EVENTS THAT THE EVENT OVERLAPS
  GROUP_CONCAT(event_2.name) as overlaps_names
from events as event_1
inner join events as event_2
on
  event_1.user_id = event_2.user_id
and
  event_1.id != event_2.id
and
(
    # START AFTER THE EVENT ONE
    event_2.start_time >= event_1.start_time and
    #  ENDS BEFORE THE EVENT ONE
    event_2.end_time   <= event_1.end_time
)
  group by 
event_1.name

结果:

+------------+----------------+
| event_name | overlaps_names |
+------------+----------------+
| c          | d              |
| f          | b,d,c          |
+------------+----------------+

要检测部分重叠,您将需要以下内容:

To detect the partial overlaps, you will need something like this:

select 
  # EVENT NAME
  event_1.name as event_name,
  # LIST EVENTS THAT THE EVENT OVERLAPS
  GROUP_CONCAT(event_2.name) as overlaps_names
from events as event_1
inner join events as event_2
on
  event_1.user_id = event_2.user_id
and
  event_1.id != event_2.id
and
(
  (
    # START AFTER THE EVENT ONE
    event_2.start_time >= event_1.start_time and
    #  ENDS BEFORE THE EVENT ONE
    event_2.start_time <= event_1.end_time
   ) or
  (
    # START AFTER THE EVENT ONE
    event_2.end_time >= event_1.start_time and
    #  ENDS BEFORE THE EVENT ONE
    event_2.end_time <= event_1.end_time
   )
)
  group by 
event_1.name

结果:

+------------+----------------+
| event_name | overlaps_names |
+------------+----------------+
| a          | b,f            |
| b          | c,d,a          |
| c          | b,d,e,g        |
| d          | b,e            |
| e          | f,g,d,c        |
| f          | a,g,b,d,c,e    |
| g          | c,e,f          |
+------------+----------------+

当然,我使用的是分组依据",以便于阅读.如果要对重叠数据求和或取平均值,以在删除之前更新父数据,这也可能很有用.也许"group_concat"函数在Postgres中不存在或具有不同的名称.您可以测试的一种标准SQL"是:

Of course, I am using a "group by" to make easier to read. That could be useful too if you want to sum or take the average of the overlaps data to update your parent data before the delete. Maybe that "group_concat" function does not exist into Postgres or have a different name. One "standard SQL" that you could test it is:

select 
  # EVENT NAME
  event_1.name as event_name,
  # LIST EVENTS THAT THE EVENT OVERLAPS
  event_2.name as overlaps_name
from events as event_1
inner join events as event_2
on
  event_1.user_id = event_2.user_id
and
  event_1.id != event_2.id
and
(
    # START AFTER THE EVENT ONE
    event_2.start_time >= event_1.start_time and
    #  ENDS BEFORE THE EVENT ONE
    event_2.end_time   <= event_1.end_time
)

结果:

+------------+---------------+
| event_name | overlaps_name |
+------------+---------------+
| f          | b             |
| f          | c             |
| c          | d             |
| f          | d             |
+------------+---------------+

如果要尝试一些数学运算,请记住将"c"和"d"数据的值添加到"b"并再次将它们的值添加到"f"的风险,使"f"错了.

If you are going to try some math operations, keep in mind the risk of adding the value of the "c" and "d" data on "b" and adding their value again on "f", making the value of "f" wrong.

// should be
new f = old f + b + old c + d
new c = old c + b + d // unecessary if you are going to delete it

// very common mistake
new c = old c + b + d // unecessary but not wrong yet
new f = new c + b + d = ( old c + b + d ) + b + d // wrong!!

您可以测试所有这些查询,并使用以下URL在线在同一数据库中创建自己的查询 http ://sqlfiddle.com/#!9/1d2455/19 .但是,请记住,它是Mysql,而不是Postgresql.但是测试标准SQL很好.

You can test all these queries and create your own into the same database online using this URL http://sqlfiddle.com/#!9/1d2455/19. But, keep in mind that it is Mysql, not Postgresql. But it is very good to test standard SQL.

这篇关于PostgreSQL查询删除具有重叠时间的记录,同时保留最早的记录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆