计算 SQL 中的并发事件数 [英] Calculate number of concurrent events in SQL

查看：12 发布时间：2022/1/13 8:54:54 sql postgresql performance timestamp query-optimization

本文介绍了计算 SQL 中的并发事件数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个保存电话的表格，其中包含以下字段:

身份证
开始时间
结束时间
状态
CALL_FROM
CALL_TO

有 290 万条记录加载到本地 PostgreSQL 数据库中.我在 ID(唯一索引)、开始时间和结束时间上添加了索引.

在 stackoverflow 上搜索，我发现了一些有用的 SQL，并将其修改为我认为在逻辑上应该可以工作的内容.问题是查询运行了好几个小时并且永远不会返回:

SELECT T1.sid, count(*) as CountSimultaneous从calls_nov T1，calls_nov T2在哪里T2.StartTime 和 T2.EndTime 之间的 T1.StartTime和 T1.StartTime 在 '2011-11-02' 和 '2011-11-03' 之间通过...分组T1.sidORDER BY CountSimultaneous DESC；

是否有人可以建议一种方法来修复查询/索引以使其实际工作或建议另一种方法来计算并发调用?

解释计划:

排序(成本=11796758237.81..11796758679.47 行=176663 宽度=35)排序键:(count(*))->GroupAggregate(成本=0.00..11796738007.56 行=176663 宽度=35)->嵌套循环(成本=0.00..11511290152.45 行=57089217697 宽度=35)

表创建脚本:

CREATE TABLE 调用_nov (sid varchar,开始时间时间戳，结束时间时间戳，call_to varchar,call_from varchar,状态变量)；

索引创建:

在 call_nov (sid) 上创建唯一索引 sid_unique_index；CREATE INDEX starttime_index on calls_nov (starttime);在calls_nov(结束时间)上创建索引endtime_index；

解决方案

1.) 您的查询没有捕获所有重叠 - 这已经被其他答案修复了.

2.) starttime 和 endtime 列的数据类型是 timestamp.所以你的 WHERE 子句也有点错误:

在2011-11-02"和2011-11-03"之间

这将包括2011-11-03 00:00".上边框必须排除.

3.) 删除了不带双引号的混合大小写语法.不带引号的标识符会自动转换为小写.简单来说:最好不要在 PostgreSQL 中使用大小写混合的标识符.

4.) 将查询转换为使用总是更可取的显式 JOIN.实际上，我将其设为 LEFT [OUTER] JOIN，因为我也想计算与其他调用不重叠的调用.

5.) 稍微简化语法以达到此基本查询:

SELECT t1.sid, count(*) AS ctFROM calls_nov t1LEFT JOIN calls_nov t2 ON t1.starttime <= t2.endtimeAND t1.endtime >= t2.starttimeWHERE t1.starttime >= '2011-11-02 0:0'::timestampAND t1.starttime <'2011-11-03 0:0'::时间戳按 1 分组按 2 DESC 排序；

对于大表，此查询非常慢，因为从 '2011-11-02' 开始的每一行都必须与整个表中的每一行进行比较，这导致(几乎)O(n²) 成本.

更快

我们可以通过预选可能的候选人来大幅降低成本.只选择您需要的列和行.我用两个 CTE 做到这一点.

选择从相关日期开始的通话.->CTE x
计算这些通话的最晚结束时间.(CTE 中的子查询 y)
仅选择与 CTE x 的总范围重叠的调用.->CTE y
最终查询比查询庞大的基础表快得多.

WITH x AS (选择 sid、开始时间、结束时间FROM calls_novWHERE 开始时间 >= '2011-11-02 0:0'AND 开始时间 <'2011-11-03 0:0'), y AS (选择开始时间，结束时间FROM calls_novWHERE 结束时间 >= '2011-11-02 0:0'AND starttime <= (SELECT max(endtime) As max_endtime FROM x))SELECT x.sid, count(*) AS count_overlaps从 xLEFT JOIN y ON x.starttime <= y.endtimeAND x.endtime >= y.starttime按 1 分组按 2 DESC 排序；

更快

我有一个包含 350.000 行的真实表格，其中包含与您类似的重叠开始/结束时间戳.我将它用于快速基准测试.PostgreSQL 8.4，资源稀缺，因为它是一个测试数据库.start 和 end 上的索引.(ID 列的索引在这里无关紧要.)使用 EXPLAIN ANALYZE 进行测试，最好是 5.

总运行时间:476994.774 毫秒

CTE 变体:
总运行时间:4199.788 毫秒——>因子 100.

添加多列索引形式后:

CREATE INDEX start_end_index on calls_nov (starttime, endtime);

总运行时间:4159.367 毫秒

终极速度

如果这还不够，还有一种方法可以将其加快另一个数量级.而不是上面的 CTE，具体化临时表 - 这是关键点 - 在第二个表上创建一个 index.可能看起来像这样:

作为一个交易执行:

CREATE TEMP TABLE x ON COMMIT DROP AS选择 sid、开始时间、结束时间FROM calls_novWHERE 开始时间 >= '2011-11-02 0:0'AND 开始时间 <'2011-11-03 0:0';CREATE TEMP TABLE y ON COMMIT DROP AS选择开始时间，结束时间FROM calls_novWHERE 结束时间 >= '2011-11-02 0:0'AND starttime <= (SELECT max(endtime) FROM x);CREATE INDEX y_idx ON y(开始时间，结束时间)；——这就是魔法发生的地方选择 x.sid，计数(*)作为 ct从 xLEFT JOIN y ON x.starttime <= y.endtimeAND x.endtime >= y.starttime按 1 分组按 2 DESC 排序；

阅读手册中的临时表.p>

终极解决方案

创建一个封装魔法的plpgsql函数.

诊断临时表的典型大小.独立创建并测量:

 选择 pg_size_pretty(pg_total_relation_size('tmp_tbl'));

如果它们大于 temp_buffers 的设置然后在您的函数中临时将它们设置得足够高，以将两个临时表都保存在 RAM 中.如果您不必交换到光盘，这是一个重大的加速.(必须首先在会话中使用临时表才能生效.)

创建或替换函数 f_call_overlaps(date)返回表(sid varchar，ct 整数)AS$身体$宣布_from 时间戳 := $1::timestamp;_to 时间戳:= ($1 +1)::timestamp;开始SET temp_buffers = 64MB';-- 示例值；临时表的更多 RAM；CREATE TEMP TABLE x ON COMMIT DROP ASSELECT c.sid, starttime, endtime -- 避免与 OUT 参数的命名冲突FROM calls_nov cWHERE 开始时间 >= _fromAND 开始时间 <_到;CREATE TEMP TABLE y ON COMMIT DROP AS选择开始时间，结束时间FROM calls_novWHERE 结束时间 >= _fromAND starttime <= (SELECT max(endtime) FROM x);CREATE INDEX y_idx ON y(开始时间，结束时间)；返回查询选择 x.sid, count(*)::int -- AS ct从 xLEFT JOIN y ON x.starttime <= y.endtime AND x.endtime >= y.starttime按 1 分组按 2 DESC 排序；结尾;$BODY$ 语言 plpgsql;

呼叫:

SELECT * FROM f_call_overlaps('2011-11-02') -- 命名你的日期

总运行时间:138.169 毫秒 - 这是 3000 倍

你还能做些什么来加快速度?

一般性能优化.

CLUSTER calls_nov USING starttime_index;- 这也完全吸尘表分析calls_nov；

I have a table that holds phone calls, with the following fields:

ID
STARTTIME
ENDTIME
STATUS
CALL_FROM
CALL_TO

There are 2,9 million records loaded into a local PostgreSQL database. I added indexes on ID (unique index), starttime and endtime.

Searching on stackoverflow, I found some useful SQL and modified it to what I think logically should work. The problem is that the query runs for many hours and never returns:

SELECT T1.sid, count(*) as CountSimultaneous
FROM calls_nov T1, calls_nov T2
WHERE
     T1.StartTime between T2.StartTime and T2.EndTime
     and T1.StartTime between '2011-11-02' and '2011-11-03'
GROUP BY
     T1.sid
ORDER BY CountSimultaneous DESC;

Can someone please either suggest a way to fix the query/index so that it actually works or suggest another way to calculate concurrent calls?

EDIT:

Explain plan:

Sort  (cost=11796758237.81..11796758679.47 rows=176663 width=35)
  Sort Key: (count(*))
  ->  GroupAggregate  (cost=0.00..11796738007.56 rows=176663 width=35)
        ->  Nested Loop  (cost=0.00..11511290152.45 rows=57089217697 width=35)

Table creation script:

CREATE TABLE calls_nov (
  sid varchar,
  starttime timestamp, 
  endtime timestamp, 
  call_to varchar, 
  call_from varchar, 
  status varchar);

Index creation:

CREATE UNIQUE INDEX sid_unique_index on calls_nov (sid);

CREATE INDEX starttime_index on calls_nov (starttime);

CREATE INDEX endtime_index on calls_nov (endtime);

解决方案

1.) Your query did not catch all overlaps - this was fixed by the other answers, already.

2.) The data type of your columns starttime and endtime is timestamp. So your WHERE clause is slightly wrong, too:

BETWEEN '2011-11-02' AND '2011-11-03'

This would include '2011-11-03 00:00'. The upper border has to be excluded.

3.) Removed the mixed case syntax without double-quotes. Unquoted identifiers are cast to lower case automatically. To put it simple: Best don't use mixed case identifiers at all in PostgreSQL.

4.) Transformed the query to use explicit JOIN which is always preferable. Actually, I made it a LEFT [OUTER] JOIN, because I want to count calls that overlap with no other calls, too.

5.) Simplified the syntax a bit to arrive at this base query:

SELECT t1.sid, count(*) AS ct
FROM   calls_nov t1
LEFT   JOIN calls_nov t2 ON t1.starttime <= t2.endtime
                        AND t1.endtime >= t2.starttime
WHERE  t1.starttime >= '2011-11-02 0:0'::timestamp
AND    t1.starttime <  '2011-11-03 0:0'::timestamp
GROUP  BY 1
ORDER  BY 2 DESC;

This query is extremely slow for a big table, because every row starting on '2011-11-02' has to be compared to every row in the whole table, which leads to (almost) O(n²) cost.

Faster

We can drastically cut down the cost by pre-selecting possible candidates. Only select columns and rows you need. I do this with two CTE.

Select calls starting on the day in question. -> CTE x
Calculate the latest end of those calls. (subquery in CTE y)
Select only calls that overlap with the total range of CTE x. -> CTE y
The final query is much faster than querying the huge underlying table.

WITH x AS (
    SELECT sid, starttime, endtime
    FROM   calls_nov
    WHERE  starttime >= '2011-11-02 0:0'
    AND    starttime <  '2011-11-03 0:0'
    ), y AS (
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= '2011-11-02 0:0'
    AND    starttime <= (SELECT max(endtime) As max_endtime FROM x)
    )
SELECT x.sid, count(*) AS count_overlaps
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime
             AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;

Faster yet

I have a real life table of 350.000 rows with overlapping start / end timestamps similar to yours. I used that for a quick benchmark. PostgreSQL 8.4, scarce resources because it is a test DB. Indexes on start and end. (Index on ID column is irrelevant here.) Tested with EXPLAIN ANALYZE, best of 5.

Total runtime: 476994.774 ms

CTE variant:
Total runtime: 4199.788 ms -- that's > factor 100.

After adding a multicolumn index of the form:

CREATE INDEX start_end_index on calls_nov (starttime, endtime);

Total runtime: 4159.367 ms

Ultimate Speed

If that is not enough, there is a way to speed it up yet another order of magnitude. Instead of the CTEs above, materialize the temp tables and - this is the crucial point - create an index on the second one. Could look like this:

Execute as one transaction:

CREATE TEMP TABLE x ON COMMIT DROP AS   
    SELECT sid, starttime, endtime
    FROM   calls_nov
    WHERE  starttime >= '2011-11-02 0:0'
    AND    starttime <  '2011-11-03 0:0';

CREATE TEMP TABLE y ON COMMIT DROP AS
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= '2011-11-02 0:0'
    AND    starttime <= (SELECT max(endtime) FROM x);

CREATE INDEX y_idx ON y (starttime, endtime); -- this is where the magic happens

SELECT x.sid, count(*) AS ct
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime
             AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;

Read about temporary tables in the manual.

Ultimate solution

Create a plpgsql function that encapsulates the magic.
Diagnose the typical size of your temp tables. Create them standalone and measure:
```
  SELECT pg_size_pretty(pg_total_relation_size('tmp_tbl'));
```
If they are bigger than your setting for temp_buffers then temporarily set them high enough in your function to hold both your temporary tables in RAM. It is a major speedup if you don't have to swap to disc. (Must be first use of temp tables in session to have effect.)

CREATE OR REPLACE FUNCTION f_call_overlaps(date)
  RETURNS TABLE (sid varchar, ct integer) AS
$BODY$
DECLARE
    _from timestamp := $1::timestamp;
    _to   timestamp := ($1 +1)::timestamp;
BEGIN

SET temp_buffers = 64MB'; -- example value; more RAM for temp tables;

CREATE TEMP TABLE x ON COMMIT DROP AS   
    SELECT c.sid, starttime, endtime  -- avoid naming conflict with OUT param
    FROM   calls_nov c
    WHERE  starttime >= _from
    AND    starttime <  _to;

CREATE TEMP TABLE y ON COMMIT DROP AS
    SELECT starttime, endtime
    FROM   calls_nov
    WHERE  endtime >= _from
    AND    starttime <= (SELECT max(endtime) FROM x);

CREATE INDEX y_idx ON y (starttime, endtime);

RETURN QUERY
SELECT x.sid, count(*)::int -- AS ct
FROM   x
LEFT   JOIN y ON x.starttime <= y.endtime AND x.endtime >= y.starttime
GROUP  BY 1
ORDER  BY 2 DESC;

END;
$BODY$   LANGUAGE plpgsql;

Call:

SELECT * FROM f_call_overlaps('2011-11-02') -- just name your date

Total runtime: 138.169 ms -- that's factor 3000

What else can you do to speed it up?

General performance optimization.

CLUSTER calls_nov USING starttime_index; -- this also vacuums the table fully

ANALYZE calls_nov;

这篇关于计算 SQL 中的并发事件数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

计算 SQL 中的并发事件数 [英] Calculate number of concurrent events in SQL

问题描述

更快

更快

终极速度

终极解决方案

你还能做些什么来加快速度?

Faster

Faster yet

Ultimate Speed

Ultimate solution

What else can you do to speed it up?

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

计算 SQL 中的并发事件数 [英] Calculate number of concurrent events in SQL

问题描述

更快

更快

终极速度

终极解决方案

你还能做些什么来加快速度?

Faster

Faster yet

Ultimate Speed

Ultimate solution

What else can you do to speed it up?

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭