计算SQL中的并发事件数 [英] Calculate number of concurrent events in SQL

查看:67
本文介绍了计算SQL中的并发事件数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个可容纳电话的表格,其中包含以下字段:




  • ID

  • STARTTIME

  • ENDTIME

  • 状态

  • CALL_FROM

  • CALL_TO



有290万条记录加载到本地PostgreSQL数据库中。我在ID(唯一索引),开始时间和结束时间上添加了索引。问题是查询运行了多个小时,并且永远不会返回:

  SELECT T1.sid,count(*)为CountSimultaneous 
从calls_nov T1,calls_nov T2
WHERE
T1.StartTime在T2.StartTime和T2.EndTime之间
和T1.StartTime在'2011-11-02'和'2011-11之间-03'
GROUP BY
T1.sid
订单由Count同时计数;

有人可以建议一种解决查询/索引的方法以使其实际工作,还是建议另一种



编辑:



解释计划:

 排序(cost = 11796758237.81..11796758679.47行= 176663宽度= 35)
排序键:(count(*))
-> GroupAggregate(费用= 0.00..11796738007.56行= 176663宽度= 35)
->嵌套循环(成本= 0.00..11511290152.45行= 57089217697宽度= 35)

表创建脚本:

 创建表call_nov(
sid varchar,
开始时间时间戳,
结束时间时间戳,
call_to varchar,
call_from varchar,
status varchar);

创建索引:

 在calls_nov(sid)上创建唯一索引sid_unique_index; 

calls_nov上的CREATE INDEX starttime_index(开始时间);

通话创建索引endtime_index(结束时间);


解决方案

1。)您的查询未包含所有重叠项-



2。)列的数据类型 starttime 结束时间时间戳。因此,您的 WHERE 子句也略有错误:

  BETWEEN'2011 -11-02和 2011-11-03 

其中包括 2011-11- 03 00:00'。



3。)删除了不带双引号的混合大小写语法。未加引号的标识符会自动转换为小写。简单地说:最好不要在PostgreSQL中完全使用大小写混合的标识符。



4。)将查询转换为使用显式JOIN,这总是更可取的。实际上,我使它成为LEFT [OUTER] JOIN,因为我也想计算没有其他调用重叠的调用。



5。)简化了语法可以得出以下基本查询:

 选择t1.sid,count(*)AS ct 
FROMcalls_nov t1
左联接calls_nov t2打开t1.starttime< = t2.endtime
AND t1.endtime> = t2.starttime
W1位置t1.starttime> ='2011-11-02 0: 0':: timestamp
AND t1.starttime< ‘2011-11-03 0:0’:: timestamp
GROUP BY 1
ORDER BY 2 DESC;

对于一个大表,此查询非常慢,因为每一行都以'2011-11-02'上的数据必须与整个表格中的每一行进行比较,这会导致(几乎)O(n²)成本。






更快



我们可以通过预选可能的候选人来大幅度降低成本。仅选择所需的列和行。我用两个CTE来做。


  1. 选择在相关日期开始的通话。 -> CTE x

  2. 计算这些通话的最新结束时间。 (CTE y 中的子查询)

  3. 仅选择与CTE x的总范围重叠的呼叫。 -> CTE y

  4. 与查询巨大的基础表相比,最终查询快得多



  x AS(
SELECT sid,开始时间,结束时间
从calls_nov
WHERE开始时间> ='2011-11-02 0:0'
AND开始时间<'2011-11 -03 0:0'
),y AS(
选择开始时间,结束时间
FROMcalls_nov
WHERE endtime> ='2011-11-02 0:0'
AND starttime< =(SELECT max(endtime)as max_endtime FROM x)

SELECT x.sid,count(*)AS count_overlaps
FROM x
LEFT JOIN y ON x.starttime< = y.endtime
和x.endtime> = y.starttime
GROUP BY 1
ORDER BY 2 DESC;






更快



我有一个350.000行的现实生活表,其开始/结束时间戳记与您的相似。我将其用作快速基准。 PostgreSQL 8.4,因为它是一个测试数据库,所以资源稀缺。 开始结束上的索引。 (此处的ID列索引不相关。)使用 EXPLAIN ANALYZE (最好是5分)进行测试。



总运行时间: 476994.774 ms



CTE变体:

总运行时间:4199.788 ms-这是100倍。



添加多列索引的形式:

 在call_nov上创建索引start_end_index(开始时间,结束时间); 

总运行时间:4159.367 ms






终极速度



如果这还不够的话,有一种方法可以将其提高另一个数量级。而不是上面的CTE,具体化临时表,并且-这是关键点-在第二个表上创建一个 index 。可能看起来像这样:



作为一项交易执行:

 在提交请求时创建温度表x 
在calls_nov
中从sid,开始时间,结束时间
中启动时间> ='2011-11-02 0:0'
AND开始时间< ‘2011-11-03 0:0’;

创建温度表y作为提交
选择开始时间,结束时间
从calls_nov
处开始调用,结束时间> ='2011-11-02 0:0'
AND开始时间< =(从x中选择最大(结束时间));

CREATE INDEX y_idx ON y(开始时间,结束时间); -这就是魔术发生的地方

选择x.sid,count(*)as ct
FROM x
左联接y ON x.starttime< = y.endtime
AND x.endtime> = y.starttime
GROUP BY 1
ORDER BY 2 DESC;

阅读有关手册中的临时表






终极解决方案




  • 创建一个封装了魔术的 plpgsql函数


  • 诊断临时表的典型大小。独立创建它们并进行测量:

      SELECT pg_size_pretty(pg_total_relation_size(’tmp_tbl’)); 


  • 如果它们大于您为 temp_buffers ,然后在函数中暂时将它们设置得足够高,以将两个临时表都保存在RAM中。如果您不必换成光盘,则可以大大提高速度。 (必须首先在会话中使用临时表才能生效。)






 创建或替换功能f_call_overlaps(date)
返回表(sid varchar,ct整数)AS
$ BODY $
DECLARE
_from timestamp:= $ 1 :: timestamp;
_to时间戳:=($ 1 +1):: timestamp;
BEGIN

SET temp_buffers = 64MB’; -示例值;临时表有更多RAM;

创建温度表x在提交时作为
选择c.sid,开始时间,结束时间-避免与OUT参数命名冲突
从calls_nov c
开始时间> ; = _from
AND开始时间< _至;

创建温度表y作为提交
选择开始时间,结束时间
从calls_nov
到哪里,结束时间> = _从
并且开始时间< =( SELECT max(endtime)FROM x);

CREATE INDEX y_idx ON y(开始时间,结束时间);

返回查询
选择x.sid,count(*):: int-as ct
FROM x
左联接y ON x.starttime< = y.endtime AND x.endtime> = y.starttime
GROUP BY 1
ORDER BY 2 DESC;

END;
$ BODY $语言plpgsql;

致电:

  SELECT * FROM f_call_overlaps('2011-11-02')-仅命名您的日期

总运行时间:138.169毫秒-这是3000的因数






您还能做什么?



常规性能优化

  CLUSTERcalls_nov使用starttime_index; -这样也可以完全清理桌子

ANALYZE Calls_nov;


I have a table that holds phone calls, with the following fields:

  • ID
  • STARTTIME
  • ENDTIME
  • STATUS
  • CALL_FROM
  • CALL_TO

There are 2,9 million records loaded into a local PostgreSQL database. I added indexes on ID (unique index), starttime and endtime.

Searching on stackoverflow, I found some useful SQL and modified it to what I think logically should work. The problem is that the query runs for many hours and never returns:

SELECT T1.sid, count(*) as CountSimultaneous
FROM calls_nov T1, calls_nov T2
WHERE
     T1.StartTime between T2.StartTime and T2.EndTime
     and T1.StartTime between '2011-11-02' and '2011-11-03'
GROUP BY
     T1.sid
ORDER BY CountSimultaneous DESC;

Can someone please either suggest a way to fix the query/index so that it actually works or suggest another way to calculate concurrent calls?

EDIT:

Explain plan:

Sort  (cost=11796758237.81..11796758679.47 rows=176663 width=35)
  Sort Key: (count(*))
  ->  GroupAggregate  (cost=0.00..11796738007.56 rows=176663 width=35)
        ->  Nested Loop  (cost=0.00..11511290152.45 rows=57089217697 width=35)

Table creation script:

CREATE TABLE calls_nov (
  sid varchar,
  starttime timestamp, 
  endtime timestamp, 
  call_to varchar, 
  call_from varchar, 
  status varchar);

Index creation:

CREATE UNIQUE INDEX sid_unique_index on calls_nov (sid);

CREATE INDEX starttime_index on calls_nov (starttime);

CREATE INDEX endtime_index on calls_nov (endtime);

解决方案

1.) Your query did not catch all overlaps - this was fixed by the other answers, already.

2.) The data type of your columns starttime and endtime is timestamp. So your WHERE clause is slightly wrong, too:

BETWEEN '2011-11-02' AND '2011-11-03'

This would include '2011-11-03 00:00'. The upper border has to be excluded.

3.) Removed the mixed case syntax without double-quotes. Unquoted identifiers are cast to lower case automatically. To put it simple: Best don't use mixed case identifiers at all in PostgreSQL.

4.) Transformed the query to use explicit JOIN which is always preferable. Actually, I made it a LEFT [OUTER] JOIN, because I want to count calls that overlap with no other calls, too.

5.) Simplified the syntax a bit to arrive at this base query:

SELECT t1.sid, count(*) AS ct
FROM   calls_nov t1
LEFT   JOIN calls_nov t2 ON t1.starttime <= t2.endtime
                        AND t1.endtime >= t2.starttime
WHERE  t1.starttime >= '2011-11-02 0:0'::timestamp
AND    t1.starttime <  '2011-11-03 0:0'::timestamp
GROUP  BY 1
ORDER  BY 2 DESC;

This query is extremely slow for a big table, because every row starting on '2011-11-02' has to be compared to every row in the whole table, which leads to (almost) O(n²) cost.


Faster

We can drastically cut down the cost by pre-selecting possible candidates. Only select columns and rows you need. I do this with two CTE.

  1. Select calls starting on the day in question. -> CTE x
  2. Calculate the latest end of those calls. (subquery in CTE y)
  3. Select only calls that overlap with the total range of CTE x. -> CTE y
  4. The final query is much faster than querying the huge underlying table.

WITH x AS (
    SELECT sid, starttime, endtime
    FROM   calls_nov
    WHERE  starttime >= '2011-11-02 0:0'
    AND    starttime <  '2011-11-03 0:0'
    ), y AS (
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= '2011-11-02 0:0'
    AND    starttime <= (SELECT max(endtime) As max_endtime FROM x)
    )
SELECT x.sid, count(*) AS count_overlaps
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime
             AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;


Faster yet

I have a real life table of 350.000 rows with overlapping start / end timestamps similar to yours. I used that for a quick benchmark. PostgreSQL 8.4, scarce resources because it is a test DB. Indexes on start and end. (Index on ID column is irrelevant here.) Tested with EXPLAIN ANALYZE, best of 5.

Total runtime: 476994.774 ms

CTE variant:
Total runtime: 4199.788 ms -- that's > factor 100.

After adding a multicolumn index of the form:

CREATE INDEX start_end_index on calls_nov (starttime, endtime);

Total runtime: 4159.367 ms


Ultimate Speed

If that is not enough, there is a way to speed it up yet another order of magnitude. Instead of the CTEs above, materialize the temp tables and - this is the crucial point - create an index on the second one. Could look like this:

Execute as one transaction:

CREATE TEMP TABLE x ON COMMIT DROP AS   
    SELECT sid, starttime, endtime
    FROM   calls_nov
    WHERE  starttime >= '2011-11-02 0:0'
    AND    starttime <  '2011-11-03 0:0';

CREATE TEMP TABLE y ON COMMIT DROP AS
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= '2011-11-02 0:0'
    AND    starttime <= (SELECT max(endtime) FROM x);

CREATE INDEX y_idx ON y (starttime, endtime); -- this is where the magic happens

SELECT x.sid, count(*) AS ct
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime
             AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;

Read about temporary tables in the manual.


Ultimate solution

  • Create a plpgsql function that encapsulates the magic.

  • Diagnose the typical size of your temp tables. Create them standalone and measure:

    SELECT pg_size_pretty(pg_total_relation_size('tmp_tbl'));
    

  • If they are bigger than your setting for temp_buffers then temporarily set them high enough in your function to hold both your temporary tables in RAM. It is a major speedup if you don't have to swap to disc. (Must be first use of temp tables in session to have effect.)

CREATE OR REPLACE FUNCTION f_call_overlaps(date)
  RETURNS TABLE (sid varchar, ct integer) AS
$BODY$
DECLARE
    _from timestamp := $1::timestamp;
    _to   timestamp := ($1 +1)::timestamp;
BEGIN

SET temp_buffers = 64MB'; -- example value; more RAM for temp tables;

CREATE TEMP TABLE x ON COMMIT DROP AS   
    SELECT c.sid, starttime, endtime  -- avoid naming conflict with OUT param
    FROM   calls_nov c
    WHERE  starttime >= _from
    AND    starttime <  _to;

CREATE TEMP TABLE y ON COMMIT DROP AS
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= _from
    AND    starttime <= (SELECT max(endtime) FROM x);

CREATE INDEX y_idx ON y (starttime, endtime);

RETURN QUERY
SELECT x.sid, count(*)::int -- AS ct
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;

END;
$BODY$   LANGUAGE plpgsql;

Call:

SELECT * FROM f_call_overlaps('2011-11-02') -- just name your date

Total runtime: 138.169 ms -- that's factor 3000


What else can you do to speed it up?

General performance optimization.

CLUSTER calls_nov USING starttime_index; -- this also vacuums the table fully

ANALYZE calls_nov;

这篇关于计算SQL中的并发事件数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆