优化每天查看特定时间窗口的查询 [英] Optimizing query that looks at a specific time window each day

查看:77
本文介绍了优化每天查看特定时间窗口的查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我上一个问题的跟进

优化查询以获取整个行,其中一个字段是组的最大字段

我将从那里使用的名称更改名称,以使它们更令人难忘,但这些并不代表我的实际用例(因此请不要估计它们的记录数).

我有一个具有这样的架构的表:

OrderTime           DATETIME(6),
Customer            VARCHAR(50),
DrinkPrice          DECIMAL,
Bartender           VARCHAR(50),
TimeToPrepareDrink  TIME(6),
...

我想从表格中提取代表每天每个客户在欢乐时光(下午3点至下午6点)最昂贵的饮料订单的行.因此,例如,我想要类似的结果

Date   | Customer | OrderTime   | MaxPrice   | Bartender | ...
-------+----------+-------------+------------+-----------+-----
1/1/18 |  Alice   | 1/1/18 3:45 | 13.15      | Jane      | ...
1/1/18 |  Bob     | 1/1/18 5:12 |  9.08      | Jane      | ...
1/1/18 |  Carol   | 1/1/18 4:45 | 20.00      | Tarzan    | ...
1/2/18 |  Alice   | 1/2/18 3:45 | 13.15      | Jane      | ...
1/2/18 |  Bob     | 1/2/18 5:57 |  6.00      | Tarzan    | ...
1/2/18 |  Carol   | 1/2/18 3:13 |  6.00      | Tarzan    | ...
 ...

该表在OrderTime上具有索引,并包含数百亿条记录. (我的顾客是重度饮酒者).

感谢上一个问题,我可以轻松提取特定日期的内容.我可以做类似的事情:

SELECT * FROM orders b
INNER JOIN (
    SELECT Customer, MAX(DrinkPrice) as MaxPrice
    FROM orders
    WHERE OrderTime >= '2018-01-01 15:00' 
      AND OrderTime <= '2018-01-01 18:00'
    GROUP BY Customer
) AS a
ON a.Customer = b.Customer
AND a.MaxPrice = b.DrinkPrice
WHERE b.OrderTime >= '2018-01-01 15:00'
  AND b.OrderTime <= '2018-01-01 18:00';

此查询在不到一秒钟的时间内运行.解释计划如下所示:

+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
| id| select_type | table      | type  | possible_keys | key        | ref                | Extra                                                  |
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
| 1 | PRIMARY     | b          | range | OrderTime     | OrderTime  | NULL               | Using index condition                                  |
| 1 | PRIMARY     | <derived2> | ref   | key0          | key0       | b.Customer,b.Price |                                                        |
| 2 | DERIVED     | orders     | range | OrderTime     | OrderTime  | NULL               | Using index condition; Using temporary; Using filesort |
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+

我还可以获取有关查询相关行的信息:

SELECT Date, Customer, MAX(DrinkPrice) AS MaxPrice
FROM
        orders
    INNER JOIN
        (SELECT '2018-01-01' AS Date 
         UNION
         SELECT '2018-01-02' AS Date) dates
WHERE   OrderTime >= TIMESTAMP(Date, '15:00:00')
AND OrderTime <= TIMESTAMP(Date, '18:00:00')
GROUP BY Date, Customer
 HAVING MaxPrice > 0;

此查询还可以在不到一秒钟的时间内运行.这是其解释计划的外观:

+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
| id   | select_type  | table      | type | possible_keys | key  | ref  | Extra                                          |
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
|    1 | PRIMARY      | <derived2> | ALL  | NULL          | NULL | NULL | Using temporary; Using filesort                |
|    1 | PRIMARY      | orders     | ALL  | OrderTime     | NULL | NULL | Range checked for each record (index map: 0x1) |
|    2 | DERIVED      | NULL       | NULL | NULL          | NULL | NULL | No tables used                                 |
|    3 | UNION        | NULL       | NULL | NULL          | NULL | NULL | No tables used                                 |
| NULL | UNION RESULT | <union2,3> | ALL  | NULL          | NULL | NULL |                                                |
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+

现在的问题是从表中检索剩余的字段.我尝试从以前改编该技巧,就像这样:

SELECT * FROM
        orders a
    INNER JOIN
        (SELECT Date, Customer, MAX(DrinkPrice) AS MaxPrice
        FROM
                orders
            INNER JOIN
                (SELECT '2018-01-01' AS Date
                 UNION
                 SELECT '2018-01-02' AS Date) dates
        WHERE   OrderTime >= TIMESTAMP(Date, '15:00:00')
            AND OrderTime <= TIMESTAMP(Date, '18:00:00')
        GROUP BY Date, Customer
        HAVING MaxPrice > 0) b
    ON     a.OrderTime >= TIMESTAMP(b.Date, '15:00:00')
       AND a.OrderTime <= TIMESTAMP(b.Date, '18:00:00')
       AND a.Customer = b.Customer;

但是,由于我不了解的原因,数据库选择以永久的方式执行此操作.解释计划:

+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
| id   | select_type  | table      | type | possible_keys | key  | ref        | Extra                                          |
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
|    1 | PRIMARY      | a          | ALL  | OrderTime     | NULL | NULL       |                                                |
|    1 | PRIMARY      | <derived2> | ref  | key0          | key0 | a.Customer | Using where                                    |
|    2 | DERIVED      | <derived3> | ALL  | NULL          | NULL | NULL       | Using temporary; Using filesort                |
|    2 | DERIVED      | orders     | ALL  | OrderTime     | NULL | NULL       | Range checked for each record (index map: 0x1) |
|    3 | DERIVED      | NULL       | NULL | NULL          | NULL | NULL       | No tables used                                 |
|    4 | UNION        | NULL       | NULL | NULL          | NULL | NULL       | No tables used                                 |
| NULL | UNION RESULT | <union3,4> | ALL  | NULL          | NULL | NULL       |                                                |
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+

问题:

  1. 这是怎么回事?
  2. 我该如何解决?

解决方案

任务似乎是"groupwise-max"问题.这是一种方法,仅涉及2个查询"(内部查询称为派生表").

SELECT  x.OrderDate, x.Customer, b.OrderTime,
        x.MaxPrice, b.Bartender
    FROM  
    (
        SELECT  DATE(OrderTime) AS OrderDate,
                Customer,
                Max(Price) AS MaxPrice
            FROM  tbl
            WHERE  TIME(OrderTime) BETWEEN '15:00' AND '18:00'
            GROUP BY  OrderDate, Customer 
    ) AS x
    JOIN  tbl AS b
       ON  b.OrderDate = X.OrderDate
      AND  b.customer = x.Customer
      AND  b.Price = x.MaxPrice
    WHERE  TIME(b.OrderTime) BETWEEN '15:00' AND '18:00'
    ORDER BY  x.OrderDate, x.Customer

理想的索引:

INDEX(Customer, Price)

(没有充分的理由使用MyISAM.)

每天数十亿新行

这会增加新的皱纹.每天需要超过1TB的额外磁盘空间吗?

是否可以汇总数据?这里的目标是在新数据进入时添加摘要信息,而不必重新扫描数十亿的旧数据.通过 ,您还可以删除事实表上的所有二级索引.

规范化将有助于缩小表的大小,从而加快查询速度. BartenderCustomer是此类的主要候选者-前者可能是SMALLINT UNSIGNED(2个字节; 65K值),后者是MEDIUMINT UNSIGNED(3个字节,16M).这可能会使您当前显示的5列缩小50%.标准化后,您在许多操作上的速度可能会提高2倍.

最好通过分段"数据来完成规范化-将数据加载到临时表中,在其中进行规范化,对其进行汇总,然后然后复制到主Fact表中.

请参见 http://mysql.rjweb.org/doc.php/summarytables
http://mysql.rjweb.org/doc.php/staging_table

在回到优化一个查询的问题之前,我们需要查看模式,数据流,是否可以规范化事情,汇总表是否有效等等.我希望有一个答案"该查询主要在摘要表中进行摘要.有时这会导致10倍的加速.

This is a followup to my previous question

Optimizing query to get entire row where one field is the maximum for a group

I'll change the names from what I used there to make them a little more memorable, but these don't represent my actual use-case (so don't estimate the number of records from them).

I have a table with a schema like this:

OrderTime           DATETIME(6),
Customer            VARCHAR(50),
DrinkPrice          DECIMAL,
Bartender           VARCHAR(50),
TimeToPrepareDrink  TIME(6),
...

I'd like to extract the rows from the table representing each customer's most expensive drink order during happy hour (3 PM - 6 PM) each day. So for instance I'd want results like

Date   | Customer | OrderTime   | MaxPrice   | Bartender | ...
-------+----------+-------------+------------+-----------+-----
1/1/18 |  Alice   | 1/1/18 3:45 | 13.15      | Jane      | ...
1/1/18 |  Bob     | 1/1/18 5:12 |  9.08      | Jane      | ...
1/1/18 |  Carol   | 1/1/18 4:45 | 20.00      | Tarzan    | ...
1/2/18 |  Alice   | 1/2/18 3:45 | 13.15      | Jane      | ...
1/2/18 |  Bob     | 1/2/18 5:57 |  6.00      | Tarzan    | ...
1/2/18 |  Carol   | 1/2/18 3:13 |  6.00      | Tarzan    | ...
 ...

The table has an index on OrderTime, and contains tens of billions of records. (My customers are heavy drinkers).

Thanks to the previous question I'm able to extract this for a specific day pretty easily. I can do something like:

SELECT * FROM orders b
INNER JOIN (
    SELECT Customer, MAX(DrinkPrice) as MaxPrice
    FROM orders
    WHERE OrderTime >= '2018-01-01 15:00' 
      AND OrderTime <= '2018-01-01 18:00'
    GROUP BY Customer
) AS a
ON a.Customer = b.Customer
AND a.MaxPrice = b.DrinkPrice
WHERE b.OrderTime >= '2018-01-01 15:00'
  AND b.OrderTime <= '2018-01-01 18:00';

This query runs in less than a second. The explain plan looks like this:

+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
| id| select_type | table      | type  | possible_keys | key        | ref                | Extra                                                  |
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
| 1 | PRIMARY     | b          | range | OrderTime     | OrderTime  | NULL               | Using index condition                                  |
| 1 | PRIMARY     | <derived2> | ref   | key0          | key0       | b.Customer,b.Price |                                                        |
| 2 | DERIVED     | orders     | range | OrderTime     | OrderTime  | NULL               | Using index condition; Using temporary; Using filesort |
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+

I can also get the information about the relevant rows for my query:

SELECT Date, Customer, MAX(DrinkPrice) AS MaxPrice
FROM
        orders
    INNER JOIN
        (SELECT '2018-01-01' AS Date 
         UNION
         SELECT '2018-01-02' AS Date) dates
WHERE   OrderTime >= TIMESTAMP(Date, '15:00:00')
AND OrderTime <= TIMESTAMP(Date, '18:00:00')
GROUP BY Date, Customer
 HAVING MaxPrice > 0;

This query also runs in less than a second. Here's how its explain plan looks:

+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
| id   | select_type  | table      | type | possible_keys | key  | ref  | Extra                                          |
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
|    1 | PRIMARY      | <derived2> | ALL  | NULL          | NULL | NULL | Using temporary; Using filesort                |
|    1 | PRIMARY      | orders     | ALL  | OrderTime     | NULL | NULL | Range checked for each record (index map: 0x1) |
|    2 | DERIVED      | NULL       | NULL | NULL          | NULL | NULL | No tables used                                 |
|    3 | UNION        | NULL       | NULL | NULL          | NULL | NULL | No tables used                                 |
| NULL | UNION RESULT | <union2,3> | ALL  | NULL          | NULL | NULL |                                                |
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+

The problem now is retrieving the remaining fields from the table. I tried adapting the trick from before, like so:

SELECT * FROM
        orders a
    INNER JOIN
        (SELECT Date, Customer, MAX(DrinkPrice) AS MaxPrice
        FROM
                orders
            INNER JOIN
                (SELECT '2018-01-01' AS Date
                 UNION
                 SELECT '2018-01-02' AS Date) dates
        WHERE   OrderTime >= TIMESTAMP(Date, '15:00:00')
            AND OrderTime <= TIMESTAMP(Date, '18:00:00')
        GROUP BY Date, Customer
        HAVING MaxPrice > 0) b
    ON     a.OrderTime >= TIMESTAMP(b.Date, '15:00:00')
       AND a.OrderTime <= TIMESTAMP(b.Date, '18:00:00')
       AND a.Customer = b.Customer;

However, for reasons I don't understand, the database chooses to execute this in a way that takes forever. Explain plan:

+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
| id   | select_type  | table      | type | possible_keys | key  | ref        | Extra                                          |
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
|    1 | PRIMARY      | a          | ALL  | OrderTime     | NULL | NULL       |                                                |
|    1 | PRIMARY      | <derived2> | ref  | key0          | key0 | a.Customer | Using where                                    |
|    2 | DERIVED      | <derived3> | ALL  | NULL          | NULL | NULL       | Using temporary; Using filesort                |
|    2 | DERIVED      | orders     | ALL  | OrderTime     | NULL | NULL       | Range checked for each record (index map: 0x1) |
|    3 | DERIVED      | NULL       | NULL | NULL          | NULL | NULL       | No tables used                                 |
|    4 | UNION        | NULL       | NULL | NULL          | NULL | NULL       | No tables used                                 |
| NULL | UNION RESULT | <union3,4> | ALL  | NULL          | NULL | NULL       |                                                |
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+

Questions:

  1. What is going on here?
  2. How can I fix it?

解决方案

The task seems to be a "groupwise-max" problem. Here's one approach, involving only 2 'queries' (the inner one is called a "derived table").

SELECT  x.OrderDate, x.Customer, b.OrderTime,
        x.MaxPrice, b.Bartender
    FROM  
    (
        SELECT  DATE(OrderTime) AS OrderDate,
                Customer,
                Max(Price) AS MaxPrice
            FROM  tbl
            WHERE  TIME(OrderTime) BETWEEN '15:00' AND '18:00'
            GROUP BY  OrderDate, Customer 
    ) AS x
    JOIN  tbl AS b
       ON  b.OrderDate = X.OrderDate
      AND  b.customer = x.Customer
      AND  b.Price = x.MaxPrice
    WHERE  TIME(b.OrderTime) BETWEEN '15:00' AND '18:00'
    ORDER BY  x.OrderDate, x.Customer

Desirable index:

INDEX(Customer, Price)

(There's no good reason to be using MyISAM.)

Billions of new rows per day

This adds new wrinkles. That's upwards of a terabyte of additional disk space needed each and every day?

Is it possible to summarize the data? The goal here is to add summary info as the new data comes in, and never have to re-scan the billions of old data. This may also let you remove all the secondary indexes on the Fact table.

Normalization will help shrink the table size, hence speeding up the queries. Bartender and Customer are prime candidates for such -- perhaps a SMALLINT UNSIGNED (2 bytes; 65K values) for the former and MEDIUMINT UNSIGNED (3 bytes, 16M) for the latter. That would probably shrink by 50% the 5 columns you currently show. You may get a 2x speedup on many operations after normalizing.

Normalization is best done by 'staging' the data -- Load the data into a temporary table, normalize within it, summarize it, then copy into the main Fact table.

See http://mysql.rjweb.org/doc.php/summarytables
and http://mysql.rjweb.org/doc.php/staging_table

Before getting back to the question of optimizing the one query, we need to see the schema, the data flow, whether things can be normalized, whether summary tables can be effective, etc. I would hope to have the 'answer' for the query to be mostly digested in a summary table. Sometimes this leads to a 10x speedup.

这篇关于优化每天查看特定时间窗口的查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆