计算SQL中的并发事件数 [英] Calculate number of concurrent events in SQL

查看：67 发布时间：2020/5/29 20:08:39 sql postgresql performance timestamp query-optimization

本文介绍了计算SQL中的并发事件数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个可容纳电话的表格，其中包含以下字段：

STARTTIME

ENDTIME

状态

CALL_FROM

CALL_TO

有290万条记录加载到本地PostgreSQL数据库中。我在ID（唯一索引），开始时间和结束时间上添加了索引。问题是查询运行了多个小时，并且永远不会返回：

  SELECT T1.sid，count（*）为CountSimultaneous 
从calls_nov T1，calls_nov T2 
 WHERE 
 T1.StartTime在T2.StartTime和T2.EndTime之间
和T1.StartTime在'2011-11-02'和'2011-11之间-03'
 GROUP BY 
 T1.sid 
订单由Count同时计数；

有人可以建议一种解决查询/索引的方法以使其实际工作，还是建议另一种

编辑：

解释计划：

 排序（cost = 11796758237.81..11796758679.47行= 176663宽度= 35）
排序键：（count（*））
-> GroupAggregate（费用= 0.00..11796738007.56行= 176663宽度= 35）
->嵌套循环（成本= 0.00..11511290152.45行= 57089217697宽度= 35）

表创建脚本：

 创建表call_nov（
 sid varchar，
开始时间时间戳，
结束时间时间戳，
 call_to varchar，
 call_from varchar，
 status varchar）；

创建索引：

 在calls_nov（sid）上创建唯一索引sid_unique_index； 
 
calls_nov上的CREATE INDEX starttime_index（开始时间）； 
 
通话创建索引endtime_index（结束时间）；

解决方案

1。）您的查询未包含所有重叠项-

2。）列的数据类型 starttime 和结束时间是时间戳。因此，您的 WHERE 子句也略有错误：

  BETWEEN'2011 -11-02和 2011-11-03

其中包括 2011-11- 03 00:00'。

3。）删除了不带双引号的混合大小写语法。未加引号的标识符会自动转换为小写。简单地说：最好不要在PostgreSQL中完全使用大小写混合的标识符。

4。）将查询转换为使用显式JOIN，这总是更可取的。实际上，我使它成为LEFT [OUTER] JOIN，因为我也想计算没有其他调用重叠的调用。

5。）简化了语法可以得出以下基本查询：

 选择t1.sid，count（*）AS ct 
 FROMcalls_nov t1 
左联接calls_nov t2打开t1.starttime< = t2.endtime 
 AND t1.endtime> = t2.starttime 
 W1位置t1.starttime> ='2011-11-02 0： 0':: timestamp 
 AND t1.starttime< ‘2011-11-03 0：0’:: timestamp 
 GROUP BY 1 
 ORDER BY 2 DESC;

对于一个大表，此查询非常慢，因为每一行都以'2011-11-02'上的数据必须与整个表格中的每一行进行比较，这会导致（几乎）O（n²）成本。

更快

我们可以通过预选可能的候选人来大幅度降低成本。仅选择所需的列和行。我用两个CTE来做。

选择在相关日期开始的通话。 -> CTE x

计算这些通话的最新结束时间。（CTE y 中的子查询）

仅选择与CTE x的总范围重叠的呼叫。 -> CTE y

与查询巨大的基础表相比，最终查询快得多。

  x AS（
 SELECT sid，开始时间，结束时间
从calls_nov 
 WHERE开始时间> ='2011-11-02 0：0'
 AND开始时间<'2011-11 -03 0：0'
），y AS（
选择开始时间，结束时间
 FROMcalls_nov 
 WHERE endtime> ='2011-11-02 0：0'
 AND starttime< =（SELECT max（endtime）as max_endtime FROM x）
）
 SELECT x.sid，count（*）AS count_overlaps 
 FROM x 
 LEFT JOIN y ON x.starttime< = y.endtime 
和x.endtime> = y.starttime 
 GROUP BY 1 
 ORDER BY 2 DESC;

更快

我有一个350.000行的现实生活表，其开始/结束时间戳记与您的相似。我将其用作快速基准。 PostgreSQL 8.4，因为它是一个测试数据库，所以资源稀缺。 开始和结束上的索引。（此处的ID列索引不相关。）使用 EXPLAIN ANALYZE （最好是5分）进行测试。

总运行时间： 476994.774 ms

CTE变体：

总运行时间：4199.788 ms-这是100倍。

添加多列索引的形式：

 在call_nov上创建索引start_end_index（开始时间，结束时间）；

总运行时间：4159.367 ms

终极速度

如果这还不够的话，有一种方法可以将其提高另一个数量级。而不是上面的CTE，具体化临时表，并且-这是关键点-在第二个表上创建一个 index 。可能看起来像这样：

作为一项交易执行：

 在提交请求时创建温度表x 
在calls_nov 
中从sid，开始时间，结束时间
中启动时间> ='2011-11-02 0：0'
 AND开始时间< ‘2011-11-03 0：0’； 
 
创建温度表y作为提交
选择开始时间，结束时间
从calls_nov 
处开始调用，结束时间> ='2011-11-02 0：0'
 AND开始时间< =（从x中选择最大（结束时间））; 
 
 CREATE INDEX y_idx ON y（开始时间，结束时间）； -这就是魔术发生的地方
 
选择x.sid，count（*）as ct 
 FROM x 
左联接y ON x.starttime< = y.endtime 
 AND x.endtime> = y.starttime 
 GROUP BY 1 
 ORDER BY 2 DESC;

阅读有关手册中的临时表。

终极解决方案

创建一个封装了魔术的 plpgsql函数。

诊断临时表的典型大小。独立创建它们并进行测量：

  SELECT pg_size_pretty（pg_total_relation_size（’tmp_tbl’））;

如果它们大于您为 temp_buffers ，然后在函数中暂时将它们设置得足够高，以将两个临时表都保存在RAM中。如果您不必换成光盘，则可以大大提高速度。（必须首先在会话中使用临时表才能生效。）

 创建或替换功能f_call_overlaps（date）
返回表（sid varchar，ct整数）AS 
 $ BODY $ 
 DECLARE 
 _from timestamp：= $ 1 :: timestamp; 
 _to时间戳：=（$ 1 +1）:: timestamp; 
 BEGIN 
 
 SET temp_buffers = 64MB’; -示例值;临时表有更多RAM； 
 
创建温度表x在提交时作为
选择c.sid，开始时间，结束时间-避免与OUT参数命名冲突
从calls_nov c 
开始时间> ; = _from 
 AND开始时间< _至; 
 
创建温度表y作为提交
选择开始时间，结束时间
从calls_nov 
到哪里，结束时间> = _从
并且开始时间< =（ SELECT max（endtime）FROM x）; 
 
 CREATE INDEX y_idx ON y（开始时间，结束时间）； 
 
返回查询
选择x.sid，count（*）:: int-as ct 
 FROM x 
左联接y ON x.starttime< = y.endtime AND x.endtime> = y.starttime 
 GROUP BY 1 
 ORDER BY 2 DESC; 
 
 END； 
 $ BODY $语言plpgsql;

致电：

  SELECT * FROM f_call_overlaps（'2011-11-02'）-仅命名您的日期

总运行时间：138.169毫秒-这是3000的因数

您还能做什么？

常规性能优化。

  CLUSTERcalls_nov使用starttime_index; -这样也可以完全清理桌子
 
 ANALYZE Calls_nov;

I have a table that holds phone calls, with the following fields:

ID
STARTTIME
ENDTIME
STATUS
CALL_FROM
CALL_TO

There are 2,9 million records loaded into a local PostgreSQL database. I added indexes on ID (unique index), starttime and endtime.

Searching on stackoverflow, I found some useful SQL and modified it to what I think logically should work. The problem is that the query runs for many hours and never returns:

SELECT T1.sid, count(*) as CountSimultaneous
FROM calls_nov T1, calls_nov T2
WHERE
     T1.StartTime between T2.StartTime and T2.EndTime
     and T1.StartTime between '2011-11-02' and '2011-11-03'
GROUP BY
     T1.sid
ORDER BY CountSimultaneous DESC;

Can someone please either suggest a way to fix the query/index so that it actually works or suggest another way to calculate concurrent calls?

EDIT:

Explain plan:

Sort  (cost=11796758237.81..11796758679.47 rows=176663 width=35)
  Sort Key: (count(*))
  ->  GroupAggregate  (cost=0.00..11796738007.56 rows=176663 width=35)
        ->  Nested Loop  (cost=0.00..11511290152.45 rows=57089217697 width=35)

Table creation script:

CREATE TABLE calls_nov (
  sid varchar,
  starttime timestamp, 
  endtime timestamp, 
  call_to varchar, 
  call_from varchar, 
  status varchar);

Index creation:

CREATE UNIQUE INDEX sid_unique_index on calls_nov (sid);

CREATE INDEX starttime_index on calls_nov (starttime);

CREATE INDEX endtime_index on calls_nov (endtime);

解决方案

1.) Your query did not catch all overlaps - this was fixed by the other answers, already.

2.) The data type of your columns starttime and endtime is timestamp. So your WHERE clause is slightly wrong, too:

BETWEEN '2011-11-02' AND '2011-11-03'

This would include '2011-11-03 00:00'. The upper border has to be excluded.

3.) Removed the mixed case syntax without double-quotes. Unquoted identifiers are cast to lower case automatically. To put it simple: Best don't use mixed case identifiers at all in PostgreSQL.

4.) Transformed the query to use explicit JOIN which is always preferable. Actually, I made it a LEFT [OUTER] JOIN, because I want to count calls that overlap with no other calls, too.

5.) Simplified the syntax a bit to arrive at this base query:

SELECT t1.sid, count(*) AS ct
FROM   calls_nov t1
LEFT   JOIN calls_nov t2 ON t1.starttime <= t2.endtime
                        AND t1.endtime >= t2.starttime
WHERE  t1.starttime >= '2011-11-02 0:0'::timestamp
AND    t1.starttime <  '2011-11-03 0:0'::timestamp
GROUP  BY 1
ORDER  BY 2 DESC;

This query is extremely slow for a big table, because every row starting on '2011-11-02' has to be compared to every row in the whole table, which leads to (almost) O(n²) cost.

Faster

We can drastically cut down the cost by pre-selecting possible candidates. Only select columns and rows you need. I do this with two CTE.

Select calls starting on the day in question. -> CTE x
Calculate the latest end of those calls. (subquery in CTE y)
Select only calls that overlap with the total range of CTE x. -> CTE y
The final query is much faster than querying the huge underlying table.

WITH x AS (
    SELECT sid, starttime, endtime
    FROM   calls_nov
    WHERE  starttime >= '2011-11-02 0:0'
    AND    starttime <  '2011-11-03 0:0'
    ), y AS (
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= '2011-11-02 0:0'
    AND    starttime <= (SELECT max(endtime) As max_endtime FROM x)
    )
SELECT x.sid, count(*) AS count_overlaps
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime
             AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;

Faster yet

I have a real life table of 350.000 rows with overlapping start / end timestamps similar to yours. I used that for a quick benchmark. PostgreSQL 8.4, scarce resources because it is a test DB. Indexes on start and end. (Index on ID column is irrelevant here.) Tested with EXPLAIN ANALYZE, best of 5.

Total runtime: 476994.774 ms

CTE variant:
Total runtime: 4199.788 ms -- that's > factor 100.

After adding a multicolumn index of the form:

CREATE INDEX start_end_index on calls_nov (starttime, endtime);

Total runtime: 4159.367 ms

Ultimate Speed

If that is not enough, there is a way to speed it up yet another order of magnitude. Instead of the CTEs above, materialize the temp tables and - this is the crucial point - create an index on the second one. Could look like this:

Execute as one transaction:

CREATE TEMP TABLE x ON COMMIT DROP AS   
    SELECT sid, starttime, endtime
    FROM   calls_nov
    WHERE  starttime >= '2011-11-02 0:0'
    AND    starttime <  '2011-11-03 0:0';

CREATE TEMP TABLE y ON COMMIT DROP AS
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= '2011-11-02 0:0'
    AND    starttime <= (SELECT max(endtime) FROM x);

CREATE INDEX y_idx ON y (starttime, endtime); -- this is where the magic happens

SELECT x.sid, count(*) AS ct
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime
             AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;

Read about temporary tables in the manual.

Ultimate solution

Create a plpgsql function that encapsulates the magic.
Diagnose the typical size of your temp tables. Create them standalone and measure:
```
SELECT pg_size_pretty(pg_total_relation_size('tmp_tbl'));
```
If they are bigger than your setting for temp_buffers then temporarily set them high enough in your function to hold both your temporary tables in RAM. It is a major speedup if you don't have to swap to disc. (Must be first use of temp tables in session to have effect.)

CREATE OR REPLACE FUNCTION f_call_overlaps(date)
  RETURNS TABLE (sid varchar, ct integer) AS
$BODY$
DECLARE
    _from timestamp := $1::timestamp;
    _to   timestamp := ($1 +1)::timestamp;
BEGIN

SET temp_buffers = 64MB'; -- example value; more RAM for temp tables;

CREATE TEMP TABLE x ON COMMIT DROP AS   
    SELECT c.sid, starttime, endtime  -- avoid naming conflict with OUT param
    FROM   calls_nov c
    WHERE  starttime >= _from
    AND    starttime <  _to;

CREATE TEMP TABLE y ON COMMIT DROP AS
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= _from
    AND    starttime <= (SELECT max(endtime) FROM x);

CREATE INDEX y_idx ON y (starttime, endtime);

RETURN QUERY
SELECT x.sid, count(*)::int -- AS ct
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;

END;
$BODY$   LANGUAGE plpgsql;

Call:

SELECT * FROM f_call_overlaps('2011-11-02') -- just name your date

Total runtime: 138.169 ms -- that's factor 3000

What else can you do to speed it up?

General performance optimization.

CLUSTER calls_nov USING starttime_index; -- this also vacuums the table fully

ANALYZE calls_nov;

这篇关于计算SQL中的并发事件数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

计算SQL中的并发事件数 [英] Calculate number of concurrent events in SQL

问题描述

更快

更快

终极速度

终极解决方案

您还能做什么？

Faster

Faster yet

Ultimate Speed

Ultimate solution

What else can you do to speed it up?

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

计算SQL中的并发事件数 [英] Calculate number of concurrent events in SQL

问题描述

更快

更快

终极速度

终极解决方案

您还能做什么？

Faster

Faster yet

Ultimate Speed

Ultimate solution

What else can you do to speed it up?

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭