如何优化该“间隔时间”? SQL查询? [英] How to optimize this "Time apart" SQL query?

查看:74
本文介绍了如何优化该“间隔时间”? SQL查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想有一种简单的方法来对时间序列数据运行以下查询:

I would like to have an easy way to run queries like the following query on my time series data:

发生的最典型事件是什么彼此相隔7天?

"What are the most typical events to take place within seven days of eachother"?

我可以通过使用SQL和Java程序,查看每一行并运行查询来执行此操作,该查询可以查找7天的所有事件早晚使用,但这不是很优雅,而且性能会很糟糕。

I could do this by utilizing SQL and a Java program, by looking at each row and run a query, which looks up all events seven days earlier or later, but this is not very elegant and performance will be horrible.

我还得到了JNK和Milen A. Radev的帮助,编写了以下SQL。我的问题是,当我在2300万行中对其进行测试时,它运行了两个小时并停止运行,因为我的RamDisk(运行PostgreSQL数据库的地方)已满。您是否知道如何优化此类查询?

I also got help from JNK and Milen A. Radev to write the following SQL. My problem is that when I tested this on 23 million rows, it ran for two hours and stopped because my RamDisk (where I run the PostgreSQL database) was full. Do you have any idea how I could optimize such a query?

SELECT a.eventID, b.eventID, COUNT(*)
FROM table a
INNER JOIN table b
    ON a.eventID <> b.eventID
WHERE aBS(EXTRACT(EPOCH FROM (a.thetimeanddate - b.thetimeanddate))) < 5 
GROUP BY a.eventID, b.eventID 
ORDER BY COUNT(*) DESC
LIMIT 1000;


推荐答案

部分问题是某些功能阻止了RDBMS能够推断出查询的某些属性,然后搜索任何索引。 (有关此问题的更多信息,请参见SARGABLE。)

Part of the problem is that certain functions prevent the RDBMS from being able to infer certain properties of your query and then search any indexes. (Look up SARGABLE for more info on this.)

这意味着RDBMS必须处理事件的每个组合,并检查WHERE值以查看它们是否在事件范围内彼此5天。每个组合等于529,000,000,000,000个组合。 (5.29亿美元是一个很大的数目。)

This means that the RDBMS is having to process Every combination of events and check the WHERE caluse to see if they are within 5 days of each other. Every combination equates to 529,000,000,000,000 combinations. (529 million million is quite a lot.)

如果将查询改写为 WHERE b.thetimeanddate具有这些属性,则可能会提高性能。如果您有一个覆盖[thetimeanddate]字段的索引,则会发生这种情况。例如...

If you reform the query to say "WHERE b.thetimeanddate has these properties", then you may find a performance boost. This will happen if you have an index covering [thetimeanddate] field. For example...

SELECT
  a.eventID,
  b.eventID,
  COUNT(*)
FROM
  table a
INNER JOIN
  table b
    ON a.eventID <> b.eventID
WHERE
      b.thetimeanddate >= date_trunc('day', a.thetimeanddate) - INTERVAL '5 days'
  AND b.thetimeanddate <  date_trunc('day', a.thetimeanddate) + INTERVAL '6 days'
GROUP BY
  a.eventID,
  b.eventID
ORDER BY
  COUNT(*) DESC
LIMIT
  1000
;

RDBMS现在应该能够更轻松地利用表格中包含[ thetimeanddate]字段。现在,它只为您拥有的2900万个事件中的每个事件计算出截断的日期,并检查索引以查看在此日期和该日期之间出现了多少个事件。

The RDBMS should now be able to much more easily make use of any index on the table that covers [thetimeanddate] field. It now just works out the truncated dates for each of the 29 million events that you have, and checks an index to see how many appear between "this date" and "that date". Quite possibly a few million times quicker than the alternative...

(我也很想将WHERE子句移到ON子句中,但仅出于样式目的。)性能将是相同的。请记住,RBDMS会编译这些查询,并选择算法和优化方法。如果两个代数可以被代数地操纵为相同,则它们通常会产生相同的最终执行计划。该操作存在于查询中,并且在您的脑海中并不为人所知。]

(I would also be tempted to move the WHERE clause into the ON clause, but only for style purposes. The performance will be identical. Remember, the RBDMS compiles these queries, and chooses algorithms and optimisations. If two queries can be algebraically manipulated in to being the same, they will normally yield the same final execution plan. [Assuming all the information needed for the manipulation exists in the query, and is not 'just known in your head'.])

EDIT

我还注意到您正在同时按a.eventID和b.eventID进行分组,然后进行计数。假设eventID在表中是唯一的,则总会得到1 ...

I've also noticed that you're grouping by Both a.eventID and b.eventID, then doing a count. Assuming the eventID is unique in the table, this would always yield a count of 1...

EDIT

将+5到+间隔'5天'

Changed +5 to + INTERVAL '5 days'

这篇关于如何优化该“间隔时间”? SQL查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆