发现了巨大的事件流的差距? [英] Finding gaps in huge event streams?
问题描述
我有一个PostgreSQL数据库是这种格式的大约1万个事件:
I have about 1 million events in a PostgreSQL database that are of this format:
id | stream_id | timestamp
----------+-----------------+-----------------
1 | 7 | ....
2 | 8 | ....
有大约50,000独特的流。
There are about 50,000 unique streams.
我需要找到所有事件的其中任何两个事件之间的时间超过一定时间段。换句话说,我需要找到事件对,其中有在一定时间内没有事件。
I need to find all of the events where the time between any two of the events is over a certain time period. In other words, I need to find event pairs where there was no event in a certain period of time.
例如:
a b c d e f g h i j k
| | | | | | | | | | |
\____2 mins____/
在这种情况下,我想找到对(F,G),因为这些都是紧紧围绕缺口的事件。
In this scenario, I would want to find the pair (f, g) since those are the events immediately surrounding a gap.
我不在乎,如果查询(即)慢,即1万条记录它的罚款,如果它需要一个小时左右。然而,该数据集将继续增长,所以希望如果是慢尺度三立。
I don't care if the query is (that) slow, i.e. on 1 million records it's fine if it takes an hour or so. However, the data set will keep growing, so hopefully if it's slow it scales sanely.
我也有MongoDB中的数据。
I also have the data in MongoDB.
什么是执行此查询的最佳方法是什么?
What's the best way to perform this query?
推荐答案
您可以使用的 滞后()
的在其上由所述时间戳排序一个分区通过把stream_id窗函数。该滞后()
功能,您可以访问该分区中的previous行;无滞后值,它是previous行。因此,如果在把stream_id分区按时间排序,那么previous行是previous事件的流_id。
You can do this with the lag()
window function over a partition by the stream_id which is ordered by the timestamp. The lag()
function gives you access to previous rows in the partition; without a lag value, it is the previous row. So if the partition on stream_id is ordered by time, then the previous row is the previous event for that stream_id.
SELECT stream_id, lag(id) OVER pair AS start_id, id AS end_id,
("timestamp" - lag("timestamp") OVER pair) AS diff
FROM my_table
WHERE diff > interval '2 minutes'
WINDOW pair AS (PARTITION BY stream_id ORDER BY "timestamp");
这篇关于发现了巨大的事件流的差距?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!