扁平交叉时间盘 [英] Flattening intersecting timespans
问题描述
我有大量数据,具有给定ID的开始和结束时间,我需要将所有相交和相邻的时间段平坦化为一个组合的时间段。下面发布的示例数据都是相同的ID,所以我没有列出。
I have lots of data with start and stop times for a given ID and I need to flatten all intersecting and adjacent timespans into one combined timespan. The sample data posted below is all for the same ID so I didn't list it.
为了使事情更清晰,请查看03.06的示例数据.2009:
To make things a bit clearer, take a look at the sample data for 03.06.2009:
以下时间戳重叠或重复,需要合并成一次。
The following timespans are overlapping or contiunous and need to merge into one timespan
- 05:54:48 - 10:00:13
- 09:26:45 - 09:59:40
最后的时间范围是从05:54:48到10:00:13。由于在10:00:13和10:12:50之间有差距,我们也有以下时间段:
The resulting timespan would be from 05:54:48 to 10:00:13. Since there's a gap between 10:00:13 and 10:12:50 we also have the following timespans:
- 10:12:50 - 10:27:25
- 10:13:12 - 11:14:56
- 10:27:25 - 10:27:31
- 10:27:39 - 13:53:38
- 11:14:56 - 11:15:03
- 11:15:30 - 14:02:14
- 13:53:38 - 13:53:43
- 14 :02:14 - 14:02:31
- 10:12:50 - 10:27:25
- 10:13:12 - 11:14:56
- 10:27:25 - 10:27:31
- 10:27:39 - 13:53:38
- 11:14:56 - 11:15:03
- 11:15:30 - 14:02:14
- 13:53:38 - 13:53:43
- 14:02:14 - 14:02:31
导致一个合并的时间段从10:12:50到14:02: 31,因为它们是重叠的或相邻的。
which result in one merged timespan from 10:12:50 to 14:02:31, since they're overlapping or adjacent.
下面你将会找到我需要的示例数据和扁平化数据。持续时间栏仅供参考。
Below you will find the sample data and the flattened data as I would need it. The duration column is just informative.
任何解决方案 - 不管是SQL还是不支持。
Any solution - be it SQL or not - is appreciated.
编辑:由于有许多不同而有趣的解决方案,我正在通过添加约束来改进我的原始问题,以查看最佳(如果有)解决方案泡沫:
EDIT: Since there are lots of different and interesting solutions I'm refining my original question by adding constraints to see the "best" (if there is one) solution bubble up:
- 我从另一个系统通过ODBC获取数据。没有办法更改我的表格布局或添加索引
- 数据仅由日期列索引(时间部分不是)
- 每天大约有2.5k行
- 数据的估计使用模式大致如下:
<大部分时间(让我们说90%),用户只会查询一到两天(2.5k - 5k行) - 有时(9%)的范围将由一个月(〜75k行)
- 很少(1%)范围将达一年(约900k行)
- I'm getting the data via ODBC from another system. There's no way to change the table layout for me or adding indexes
- The data is indexed only by the date column (the time part isn't)
- There are about 2.5k rows for every day
- The estimated usage pattern of the data is roughly as follows:
- Most of the time (lets say 90%) the user will query just one or two days (2.5k - 5k rows)
- Sometimes (9%) the range will be up to a month (~75k rows)
- Rarely (1%) the range will be up to a year (~900k rows)
在这些约束之内,最好的解决方案是什么?恐怕大多数解决方案将会非常缓慢,因为他们加入了日期和时间的组合,这不是我的情况下的一个索引领域。
Within these constraints, what would be the best solution? I'm afraid that most of the solutions will be horribly slow since they join on the combination of date and time, which is not an index field in my case.
您在客户端或服务器端执行所有合并?您是否首先创建一个优化的临时表,并使用该表中的一个提出的解决方案?我没有时间来测试解决方案,但是现在我会告诉你最适合我的工作。
Would you do all the merging on the client or the server side? Would you first create an optimized temp table and use one of the proposed solutions with that table? I didn't have the time to test the solutions until now but I will keep you informed what works best for me.
样本数据
Date | Start | Stop -----------+----------+--------- 02.06.2009 | 05:55:28 | 09:58:27 02.06.2009 | 10:15:19 | 13:58:24 02.06.2009 | 13:58:24 | 13:58:43 03.06.2009 | 05:54:48 | 10:00:13 03.06.2009 | 09:26:45 | 09:59:40 03.06.2009 | 10:12:50 | 10:27:25 03.06.2009 | 10:13:12 | 11:14:56 03.06.2009 | 10:27:25 | 10:27:31 03.06.2009 | 10:27:39 | 13:53:38 03.06.2009 | 11:14:56 | 11:15:03 03.06.2009 | 11:15:30 | 14:02:14 03.06.2009 | 13:53:38 | 13:53:43 03.06.2009 | 14:02:14 | 14:02:31 04.06.2009 | 05:48:27 | 09:58:59 04.06.2009 | 06:00:00 | 09:59:07 04.06.2009 | 10:15:52 | 13:54:52 04.06.2009 | 10:16:01 | 13:24:20 04.06.2009 | 13:24:20 | 13:24:24 04.06.2009 | 13:24:32 | 14:00:39 04.06.2009 | 13:54:52 | 13:54:58 04.06.2009 | 14:00:39 | 14:00:49 05.06.2009 | 05:53:58 | 09:59:12 05.06.2009 | 10:16:05 | 13:59:08 05.06.2009 | 13:59:08 | 13:59:16 06.06.2009 | 06:04:00 | 10:00:00 06.06.2009 | 10:16:54 | 10:18:40 06.06.2009 | 10:18:40 | 10:18:45 06.06.2009 | 10:23:00 | 13:57:00 06.06.2009 | 10:23:48 | 13:57:54 06.06.2009 | 13:57:21 | 13:57:38 06.06.2009 | 13:57:54 | 13:57:58 07.06.2009 | 21:59:30 | 01:58:49 07.06.2009 | 22:12:16 | 01:58:39 07.06.2009 | 22:12:25 | 01:58:28 08.06.2009 | 02:10:33 | 05:56:11 08.06.2009 | 02:10:43 | 05:56:23 08.06.2009 | 02:10:49 | 05:55:59 08.06.2009 | 05:55:59 | 05:56:01 08.06.2009 | 05:56:11 | 05:56:14 08.06.2009 | 05:56:23 | 05:56:27
平整结果
Date | Start | Stop | Duration -----------+----------+----------+--------- 02.06.2009 | 05:55:28 | 09:58:27 | 04:02:59 02.06.2009 | 10:15:19 | 13:58:43 | 03:43:24 03.06.2009 | 05:54:48 | 10:00:13 | 04:05:25 03.06.2009 | 10:12:50 | 14:02:31 | 03:49:41 04.06.2009 | 05:48:27 | 09:59:07 | 04:10:40 04.06.2009 | 10:15:52 | 14:00:49 | 03:44:58 05.06.2009 | 05:53:58 | 09:59:12 | 04:05:14 05.06.2009 | 10:16:05 | 13:59:16 | 03:43:11 06.06.2009 | 06:04:00 | 10:00:00 | 03:56:00 06.06.2009 | 10:16:54 | 10:18:45 | 00:01:51 06.06.2009 | 10:23:00 | 13:57:58 | 03:34:58 07.06.2009 | 21:59:30 | 01:58:49 | 03:59:19 08.06.2009 | 02:10:33 | 05:56:27 | 03:45:54
推荐答案
。我使用DATETIME列。分开存储时间是我的看法中的一个错误,因为当时间到了午夜的时候你会有问题。你可以调整这个来处理这种情况,如果你需要的话。该解决方案还假设开始和结束时间不为NULL。再次,如果不是这样,您可以根据需要进行调整。
Here is a SQL only solution. I used DATETIME for the columns. Storing the time separate is a mistake in my opinion, as you will have problems when the times go past midnight. You can adjust this to handle that situation though if you need to. The solution also assumes that the start and end times are NOT NULL. Again, you can adjust as needed if that's not the case.
解决方案的一般要点是获取与其他任何其他内容不重叠的所有开始时间跨越,获得与任何跨度不重叠的所有结束时间,然后将两者结合在一起。
The general gist of the solution is to get all of the start times that don't overlap with any other spans, get all of the end times that don't overlap with any spans, then match the two together.
结果与您的预期结果相符,但在一种情况下,手头检查看起来像你的预期输出错误。 6日应该在2009-06-06 10:18:45.000结束。
The results match your expected results except in one case, which checking by hand looks like you have a mistake in your expected output. On the 6th there should be a span that ends at 2009-06-06 10:18:45.000.
SELECT ST.start_time, ET.end_time FROM ( SELECT T1.start_time FROM dbo.Test_Time_Spans T1 LEFT OUTER JOIN dbo.Test_Time_Spans T2 ON T2.start_time < T1.start_time AND T2.end_time >= T1.start_time WHERE T2.start_time IS NULL ) AS ST INNER JOIN ( SELECT T3.end_time FROM dbo.Test_Time_Spans T3 LEFT OUTER JOIN dbo.Test_Time_Spans T4 ON T4.end_time > T3.end_time AND T4.start_time <= T3.end_time WHERE T4.start_time IS NULL ) AS ET ON ET.end_time > ST.start_time LEFT OUTER JOIN ( SELECT T5.end_time FROM dbo.Test_Time_Spans T5 LEFT OUTER JOIN dbo.Test_Time_Spans T6 ON T6.end_time > T5.end_time AND T6.start_time <= T5.end_time WHERE T6.start_time IS NULL ) AS ET2 ON ET2.end_time > ST.start_time AND ET2.end_time < ET.end_time WHERE ET2.end_time IS NULL
这篇关于扁平交叉时间盘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!