SQL:优化BETWEEN子句 [英] Sql: Optimizing BETWEEN clause

查看:65
本文介绍了SQL:优化BETWEEN子句的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一条声明,需要大约一个小时才能运行,所以我在寻求帮助,这样我才能更快地做到这一点.所以我们开始:

I wrote a statement that takes almost an hour to run so I am asking help so I can get to do this faster. So here we go:

我正在对两个表进行内部连接:

I am making an inner join of two tables :

我有许多用时间间隔表示的时间间隔,我只想在这些时间间隔内从量度中获取量度数据.

I have many time intervals represented by intervals and i want to get measure datas from measures only within those intervals.

intervals:有两列,一列是间隔的开始时间,另一列是间隔的结束时间(行数= 1295)

intervals: has two columns, one is the starting time, the other the ending time of the interval (number of rows = 1295)

measures:有两列,一列带有度量,另一列带有制定度量的时间(行数=一百万)

measures: has two columns, one with the measure, the other with the time the measure has been made (number of rows = one million)

我想要得到的结果是一个表格,该表格的第一列中有度量,然后是度量完成的时间,所考虑的间隔的开始/结束时间(对于行,重复的时间在考虑范围)

The result I want to get is a table with in the first column the measure, then the time the measure has been done, the begin/end time of the considered interval (it would be repeated for row with a time within the considered range)

这是我的代码:

select measures.measure as measure, measures.time as time, intervals.entry_time as entry_time, intervals.exit_time as exit_time
    from
    intervals
    inner join  
    measures
    on  intervals.entry_time<=measures.time  and measures.time <=intervals.exit_time  
    order by time asc

谢谢

推荐答案

这是一个非常普遍的问题.

This is quite a common problem.

普通B-Tree索引不适用于这样的查询:

Plain B-Tree indexes are not good for the queries like this:

SELECT  measures.measure as measure,
        measures.time as time,
        intervals.entry_time as entry_time,
        intervals.exit_time as exit_time
FROM    intervals
JOIN    measures
ON      measures.time BETWEEN intervals.entry_time AND intervals.exit_time
ORDER BY
        time ASC

索引很适合在给定范围内搜索值,例如:

An index is good for searching the values within the given bounds, like this:

,但不适用于搜索包含给定值的边界,例如:

, but not for searching the bounds containing the given value, like this:

我博客中的这篇文章更详细地解释了该问题:

This article in my blog explains the problem in more detail:

(嵌套集模型处理类似的谓词类型).

(the nested sets model deals with the similar type of predicate).

您可以在time上建立索引,这样intervals将在连接中处于领先地位,范围时间将在嵌套循环内使用.这将需要在time上进行排序.

You can make the index on time, this way the intervals will be leading in the join, the ranged time will be used inside the nested loops. This will require sorting on time.

您可以在intervals(使用MyISAM存储在MySQL中可用)上创建空间索引,该索引将在一个几何列中包含startend.这样,measures可以导致联接,并且不需要排序.

You can create a spatial index on intervals (available in MySQL using MyISAM storage) that would include start and end in one geometry column. This way, measures can lead in the join and no sorting will be needed.

但是,空间索引的速度较慢,因此,只有在测量值少而间隔长的情况下,这才是有效的.

The spatial indexes, however, are more slow, so this will only be efficient if you have few measures but many intervals.

由于间隔时间短,但度量值很多,因此只需确保在measures.time上有一个索引即可:

Since you have few intervals but many measures, just make sure you have an index on measures.time:

CREATE INDEX ix_measures_time ON measures (time)

更新:

这是一个示例脚本进行测试:

Here's a sample script to test:

BEGIN
        DBMS_RANDOM.seed(20091223);
END;
/

CREATE TABLE intervals (
        entry_time NOT NULL,
        exit_time NOT NULL
)
AS
SELECT  TO_DATE('23.12.2009', 'dd.mm.yyyy') - level,
        TO_DATE('23.12.2009', 'dd.mm.yyyy') - level + DBMS_RANDOM.value
FROM    dual
CONNECT BY
        level <= 1500
/

CREATE UNIQUE INDEX ux_intervals_entry ON intervals (entry_time)
/

CREATE TABLE measures (
        time NOT NULL,
        measure NOT NULL
)
AS
SELECT  TO_DATE('23.12.2009', 'dd.mm.yyyy') - level / 720,
        CAST(DBMS_RANDOM.value * 10000 AS NUMBER(18, 2))
FROM    dual
CONNECT BY
        level <= 1080000
/

ALTER TABLE measures ADD CONSTRAINT pk_measures_time PRIMARY KEY (time)
/

CREATE INDEX ix_measures_time_measure ON measures (time, measure)
/

此查询:

SELECT  SUM(measure), AVG(time - TO_DATE('23.12.2009', 'dd.mm.yyyy'))
FROM    (
        SELECT  *
        FROM    (
                SELECT  /*+ ORDERED USE_NL(intervals measures) */
                        *
                FROM    intervals
                JOIN    measures
                ON      measures.time BETWEEN intervals.entry_time AND intervals.exit_time
                ORDER BY
                        time
                )
        WHERE   rownum <= 500000
        )

使用NESTED LOOPS并在1.7秒内返回.

此查询:

SELECT  SUM(measure), AVG(time - TO_DATE('23.12.2009', 'dd.mm.yyyy'))
FROM    (
        SELECT  *
        FROM    (
                SELECT  /*+ ORDERED USE_MERGE(intervals measures) */
                        *
                FROM    intervals
                JOIN    measures
                ON      measures.time BETWEEN intervals.entry_time AND intervals.exit_time
                ORDER BY
                        time
                )
        WHERE   rownum <= 500000
        )

使用MERGE JOIN,我不得不在5分钟后将其停止.

uses MERGE JOIN and I had to stop it after 5 minutes.

更新2:

您很可能需要使用类似这样的提示来强制引擎在联接中使用正确的表顺序:

You will most probably need to force the engine to use the correct table order in the join using a hint like this:

SELECT  /*+ LEADING (intervals) USE_NL(intervals, measures) */
        measures.measure as measure,
        measures.time as time,
        intervals.entry_time as entry_time,
        intervals.exit_time as exit_time
FROM    intervals
JOIN    measures
ON      measures.time BETWEEN intervals.entry_time AND intervals.exit_time
ORDER BY
        time ASC

Oracle的优化器不够聪明,无法看到间隔不相交.这就是为什么它很可能会使用measures作为前导表的原因(如果间隔相交,这将是一个明智的决定).

The Oracle's optimizer is not smart enough to see that the intervals do not intersect. That's why it will most probably use measures as a leading table (which would be a wise decision should the intervals intersect).

更新3:

WITH    splits AS
        (
        SELECT  /*+ MATERIALIZE */
                entry_range, exit_range,
                exit_range - entry_range + 1 AS range_span,
                entry_time, exit_time
        FROM    (
                SELECT  TRUNC((entry_time - TO_DATE(1, 'J')) * 2) AS entry_range,
                        TRUNC((exit_time - TO_DATE(1, 'J')) * 2) AS exit_range,
                        entry_time,
                        exit_time
                FROM    intervals
                )
        ),
        upper AS
        (
        SELECT  /*+ MATERIALIZE */
                MAX(range_span) AS max_range
        FROM    splits
        ),
        ranges AS
        (
        SELECT  /*+ MATERIALIZE */
                level AS chunk
        FROM    upper
        CONNECT BY
                level <= max_range
        ),
        tiles AS
        (
        SELECT  /*+ MATERIALIZE USE_MERGE (r s) */
                entry_range + chunk - 1 AS tile,
                entry_time,
                exit_time
        FROM    ranges r
        JOIN    splits s
        ON      chunk <= range_span
        )
SELECT  /*+ LEADING(t) USE_HASH(m t) */
        SUM(LENGTH(stuffing))
FROM    tiles t
JOIN    measures m
ON      TRUNC((m.time - TO_DATE(1, 'J')) * 2) = tile
        AND m.time BETWEEN t.entry_time AND t.exit_time

此查询将时间轴划分为多个范围,并使用HASH JOIN将范围值的度量和时间戳进行联接,稍后进行精细过滤.

This query splits the time axis into the ranges and uses a HASH JOIN to join the measures and timestamps on the range values, with fine filtering later.

请参阅我的博客中的这篇文章,以获取有关其工作原理的详细说明:

See this article in my blog for more detailed explanations on how it works:

这篇关于SQL:优化BETWEEN子句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆