如何根据行类型获取每种类型的最新行并执行计算? [英] How to obtain the most recent row per type and perform calculations, depending on the row type?

查看:107
本文介绍了如何根据行类型获取每种类型的最新行并执行计算?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一些帮助来编写/优化查询,以便按类型检索每行的最新版本,并根据类型执行一些计算。如果我用一个例子来说明,我认为是最好的。



给出以下数据集:

  + ------- + ------------------- + -------------- ------- + ------------- + --------------------- + ------ -  + ---------- + 
| id | event_type | event_timestamp | message_id | sent_at |状态|费率|
+ ------- + ------------------- + ----------------- ---- + ------------- + --------------------- + -------- + ---------- +
| 1 |创建| 2016-11-25 09:17:48 | 1 | 2016-11-25 09:17:48 | 0 | 0.500000 |
| 2 | status_update | 2016-11-25 09:24:38 | 1 | 2016-11-25 09:28:49 | 1 | 0.500000 |
| 3 |创建| 2016-11-25 09:47:48 | 2 | 2016-11-25 09:47:48 | 0 | 0.500000 |
| 4 | status_update | 2016-11-25 09:54:38 | 2 | 2016-11-25 09:48:49 | 1 | 0.500000 |
| 5 | rate_update | 2016-11-25 09:55:07 | 2 | 2016-11-25 09:50:07 | 0 | 1.000000 |
| 6 |创建| 2016-11-26 09:17:48 | 3 | 2016-11-26 09:17:48 | 0 | 0.500000 |
| 7 |创建| 2016-11-27 09:17:48 | 4 | 2016-11-27 09:17:48 | 0 | 0.500000 |
| 8 | rate_update | 2016-11-27 09:55:07 | 4 | 2016-11-27 09:50:07 | 0 | 2.000000 |
| 9 | rate_update | 2016-11-27 09:55:07 | 2 | 2016-11-25 09:55:07 | 0 | 2.000000 |
+ ------- + ------------------- + ----------------- ---- + ------------- + --------------------- + -------- + ---------- +

预期结果应该是:

  + ------------ + ---------------- ---- + -------------------- + ----------------------- + 
| sent_at | sum(submitted_msg)| sum(delivered_msg)| sum(rate_total)|
+ ------------ + -------------------- + ----------- --------- + ----------------------- +
| 2016-11-25 | 2 | 2 | 2.500000 |
| 2016-11-26 | 1 | 0 | 0.500000 |
| 2016-11-27 | 1 | 0 | 2.000000 |
+ ------------ + -------------------- + ----------- --------- + ----------------------- +

在帖子结尾处是用于获得此结果的查询。我敢打赌,应该有一种方法来优化它,因为它使用带有连接的子查询,并且从我所了解的BigQuery中,最好避免连接。但首先是一些背景:

实际上,数据集表示一个仅附加表格,其中写入了multipe事件。数据的规模已经达到数亿人,并将增长到数十亿。由于BigQuery中的更新不实用,并且数据正在传输到BQ,因此我需要一种方法来检索每个事件的最近期,根据特定条件执行一些计算并返回准确结果。查询是根据用户输入动态生成的,因此可以包含更多字段/计算,但为简单起见,它们已被省略。


  • 有只有一个创建事件,但是 n 任何其他种类

  • 在进行计算时应考虑每组事件,只考虑最新事件。
    $ b

    • status_update - 更新状态

    • rate_update - 更新费率
    • 创建 - 不言自明


  • 不是 create 原始信息的剩余部分/可能并不准确(除了message_id和事件操作的字段)(数据集简化了,但想象还有更多列,稍后会添加更多事件)


    • 例如一个 rate_update 可能会或可能没有设置状态字段,或者是一个不是最终值的值,所以不能从 rate_update 事件,同样的情况也适用于 status_update


  • 可以假定该表按日期分区,每个查询都将使用该分区。这些条件现在已被省略,以利于简化。



所以我想我有几个问题:




  • 如何优化此查询?
  • 将事件放在 create 在自己的表中,其中唯一可用的字段将是与事件相关的字段,并且是连接所需的(message_id,event_timestamp)?这会减少处理的数据量吗?

  • 什么是未来添加更多事件的最佳方式,哪些会有自己的条件和计算?



其实任何关于如何高效和友好地查询数据集的建议都非常值得欢迎!谢谢! :)



我提出的怪胎是以下几点。根据 INNER JOINS 用于检索每行的最新版本/ articles / 228103647-Querying-Append-Only-Tablesrel =nofollow noreferrer> resource

  select 
sent_at作为sent_at,
作为已提交的金额(submitted_msg),
金额(delivered_msg)交货,
sum(sales_rate_total)作为sales_rate_total
FROM(

#DELIVERED
SELECT
d.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00',sent_at)AS sent_at,
0 as submitted_msg,
sum(if(status = 1,1,0))as delivered_msg,
0 as sales_rate_total
FROM`events` d
INNER JOIN

从`events`
中选择message_id,max(event_timestamp)作为ts
其中event_type =status_update
按1
)g加上d.message_id = g.message_id和d.event_timestamp = g.ts
GROUP BY 1,2

UNION ALL

#SALES RATE
SELECT
s.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00',sent_at)AS sent_at,
0 as submitted_msg,
0 as delivered_msg,
sum(sales_rate)as sales_rate_total
FROM`events` s
INNER JOIN

select message_id,max( event_timestamp)作为ts
来自`events`
其中event_type in(rate_update,create)
group by 1
)f on s.message_id = f.message_id和s.event_timestamp = f.ts
GROUP BY 1,2

UNION ALL

#SUBMITTED& REST
SELECT
r.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00',sent_at)AS sent_at,
sum(if(status = 0,1,0))as submitted_msg,
0 as delivered_msg,
0 as sales_rate_total
FROM`events` r
INNER JOIN

从`events`中选择message_id,max(event_timestamp)作为ts

其中event_type =create
group by 1
)e on r.message_id = e.message_id and r .event_timestamp = e.ts
GROUP BY 1,2

)k
group by 1



如何优化此查询?

尝试以下版本

  #standardSQL 
WITH AS类型(
SELECT
FORMAT_TIMESTAMP('%Y-%m-%d',sent_at)AS sent_at,
message_id,
FIRST_VALUE(status )OVER(PARTITION BY message_id ORDER BY(event_type =create)DESC,event_timestamp DESC)AS submitted_status,
FIRST_VALUE(status)OVER(PARTITION BY message_id ORDER BY(event_type =status_update)DESC,event_timestamp DESC) AS deliver_status,
FIRST_VALUE(rate)OVER(PARTITION BY message_id ORDER BY(event_type IN(rate_update,create))DESC,event_timestamp DESC)AS sales_rate
FROM events
),最新的AS(
SELECT
sent_at,
message_id,
ANY_VALUE(IF(submitted_status = 0,1,0))AS已提交,
ANY_VALUE(IF(delivered_status = 1,1,0))已交货,
ANY_VALUE(sales_rate)AS sales_rate
FROM类型
GROUP BY 1,2 b $ b)
SELECT
sent_at ,
SUM(已提交)已提交,
SUM(已交付)已交货AS
SUM(sales_rate)AS sales_rate_total
FROM latest
GROUP BY 1
$ b $ p
$ b

它足够简洁,易于管理,无冗余,根本无需加入,等等。
如果您的表分区 - 您可以通过在一个地方调整查询来轻松使用它。


如果您想检查上面的询问低数量的第一个元素,您可以使用下面的虚拟数据

  WITH WITH AS(
SELECT 1 AS id,'create'AS event_type,TIMESTAMP'2016-11-25 09:17:48'as event_timestamp ,1 AS message_id,TIMESTAMP'2016-11-25 09:17:48'AS sent_at,0 AS status,0.500000 AS rate UNION ALL
SELECT 2 AS id,'status_update'AS event_type,TIMESTAMP'2016-11 -25 09:24:38'as event_timestamp,1 AS message_id,TIMESTAMP'2016-11-25 09:28:49'AS sent_at,1 AS status,0.500000 AS rate UNION ALL
SELECT 3 AS id,'创建'AS event_type,TIMESTAMP '2016-11-25 09:47:48'as event_timestamp,2 AS message_id,TIMESTAMP'2016-11-25 09:47:48'AS sent_at,0 AS status,0.500000 AS rate UNION ALL
SELECT 4 AS id,'status_update'AS event_type,TIMESTAMP'2016-11-25 09:54:38'AS event_timestamp,2 AS message_id,TIMESTAMP'2016-11-25 09:48:49'AS sent_at,1 AS状态,0.500000 AS rate UNION ALL
SELECT 5 AS id,'rate_update'AS event_type,TIMESTAMP'2016-11-25 09:55:07'AS event_timestamp,2 AS message_id,TIMESTAMP'2016-11-25 09:50: 07'AS as sent_at,0 AS status,1.000000 AS rate UNION ALL
SELECT 6 AS id,'create'AS event_type,TIMESTAMP'2016-11-26 09:17:48'as event_timestamp,3 AS message_id,TIMESTAMP '2016-11-26 09:17:48'AS sent_at,0 AS status,0.500000 AS rate UNION ALL
SELECT 7 AS id,'create'AS event_type,TIMESTAMP'2016-11-27 09:17: 48'AS event_timestamp,4 AS message_id,TIMESTAMP'2016-11-27 09:17:48'AS sent_at,0 AS状态,0.500000 AS率UNION ALL
SELECT 8 AS id,'rate_update'AS event_type,TIMESTAMP'2016-11-27 09:55:07'AS event_timestamp,4 AS message_id,TIMESTAMP'2016-11-27 09:50:07' AS sent_at,0 AS status,2.000000 AS rate UNION ALL
SELECT 9 AS id,'rate_update'AS event_type,TIMESTAMP'2016-11-27 09:55:07'AS event_timestamp,2 AS message_id,TIMESTAMP'2016 -11-25 09:55:07'as sent_at,0 AS status,2.000000 AS rate


I need some help writing/optimizing a query to retrieve the latest version of each row by type and performing some calculations depending on the type. I think would be best if I illustrate it with an example.

Given the following dataset:

+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| id    | event_type        | event_timestamp     | message_id  | sent_at             | status | rate     |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| 1     | create            | 2016-11-25 09:17:48 | 1           | 2016-11-25 09:17:48 | 0      | 0.500000 |
| 2     | status_update     | 2016-11-25 09:24:38 | 1           | 2016-11-25 09:28:49 | 1      | 0.500000 |
| 3     | create            | 2016-11-25 09:47:48 | 2           | 2016-11-25 09:47:48 | 0      | 0.500000 |
| 4     | status_update     | 2016-11-25 09:54:38 | 2           | 2016-11-25 09:48:49 | 1      | 0.500000 |
| 5     | rate_update       | 2016-11-25 09:55:07 | 2           | 2016-11-25 09:50:07 | 0      | 1.000000 |
| 6     | create            | 2016-11-26 09:17:48 | 3           | 2016-11-26 09:17:48 | 0      | 0.500000 |
| 7     | create            | 2016-11-27 09:17:48 | 4           | 2016-11-27 09:17:48 | 0      | 0.500000 |
| 8     | rate_update       | 2016-11-27 09:55:07 | 4           | 2016-11-27 09:50:07 | 0      | 2.000000 |
| 9     | rate_update       | 2016-11-27 09:55:07 | 2           | 2016-11-25 09:55:07 | 0      | 2.000000 |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+

The expected result should be:

+------------+--------------------+--------------------+-----------------------+
| sent_at    | sum(submitted_msg) | sum(delivered_msg) | sum(rate_total)       |
+------------+--------------------+--------------------+-----------------------+
| 2016-11-25 |                  2 |                  2 |              2.500000 |
| 2016-11-26 |                  1 |                  0 |              0.500000 |
| 2016-11-27 |                  1 |                  0 |              2.000000 |
+------------+--------------------+--------------------+-----------------------+

At the end of the post is the query that is used to obtain this result. I'm willing to bet that there should be a way to optimize it, since it's using subqueries with joins, and from what I have read about BigQuery, joins should best be avoided. But first some background:

In essence, the dataset represents an append-only table, to which multipe events are written. The size of the data is in the hundreds of millions and will grow to billions+. Since Updates in BigQuery are not practical, and the data is being streamed to BQ, I need a way to retrieve the most recent of each events, perform some calculations based on the certain conditions and return an accurate result. The query is generated dynamically, based on user input, so more fields/calculations can be included, but have been ommited for simplicity.

  • There is only one create event, but n of any other kind
  • For each group of events, only the latest should be taken into account when doing the calculations.
    • status_update - updates the status
    • rate_update - updates the rate
    • create - self explanatory
  • Every event that is not create may not carry the rest of the information of the original/may not be accurate(except for message_id and the field that the event is operating on) (the dataset is simplified, but imagine there are many more columns, and more events will be added later)
    • E.g. a rate_update may or may not have the status field set, or be a value that is not the final, so no calculation can be made on the status field from a rate_update event and the same goes for status_update
  • It can be assumed that the table is partitioned by date and every query will make use of the partions. Those conditions have been omitted in favor of simplicity for now.

So I guess I have a couple of questions:

  • How can this query be optimized?
  • Will it be a better idea to put the events, other than create in their own tables, where the only fields available will be the ones relevant for the events, and needed for the joins(message_id, event_timestamp)? Will this reduce the amount of data processed?
  • What would be the most optimal way to add more events in the future, which will have their own conditions and calculations?

Actually any advice on how to query this dataset efficiently and friendly is more than welcome! Thank you! :)

The monstrosity I've come up with is the following. The INNER JOINS are used to retrieve the latest version of each row, as per this resource

    select
    sent_at as sent_at,
    sum(submitted_msg) as submitted,
    sum(delivered_msg) as delivered,
    sum(sales_rate_total) as sales_rate_total
    FROM (

      #DELIVERED
        SELECT 
            d.message_id,
            FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
            0 as submitted_msg,
            sum(if(status=1,1,0)) as delivered_msg,
            0 as sales_rate_total
        FROM `events` d
        INNER JOIN
                (
                    select message_id, max(event_timestamp) as ts 
                    from `events` 
                    where event_type = "status_update" 
                    group by 1
                    ) g on d.message_id = g.message_id and d.event_timestamp = g.ts
        GROUP BY 1,2

        UNION ALL

      #SALES RATE
        SELECT 
            s.message_id,
            FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
            0 as submitted_msg,
            0 as delivered_msg,
            sum(sales_rate) as sales_rate_total
        FROM `events` s
        INNER JOIN 
                    (
                    select message_id, max(event_timestamp) as ts 
                    from `events` 
                    where event_type in ("rate_update", "create")  
                    group by 1
                    ) f on s.message_id = f.message_id and s.event_timestamp = f.ts
        GROUP BY 1,2

        UNION ALL

      #SUBMITTED & REST
        SELECT 
            r.message_id,
            FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
            sum(if(status=0,1,0)) as submitted_msg,
            0 as delivered_msg,
            0 as sales_rate_total
        FROM `events` r
        INNER JOIN
                (
                    select message_id, max(event_timestamp) as ts 
                    from `events` 
                    where event_type = "create" 
                    group by 1
                    ) e on r.message_id = e.message_id and r.event_timestamp = e.ts
        GROUP BY 1, 2

    ) k
    group by 1

解决方案

How can this query be optimized?

Try below version

#standardSQL
WITH types AS (
  SELECT 
    FORMAT_TIMESTAMP('%Y-%m-%d', sent_at) AS sent_at,
    message_id,
    FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "create") DESC, event_timestamp DESC) AS submitted_status,
    FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "status_update") DESC, event_timestamp DESC) AS delivered_status,
    FIRST_VALUE(rate) OVER(PARTITION BY message_id ORDER BY (event_type IN ("rate_update", "create")) DESC, event_timestamp DESC) AS sales_rate
  FROM events
), latest AS (
  SELECT 
    sent_at,
    message_id,
    ANY_VALUE(IF(submitted_status=0,1,0)) AS submitted,  
    ANY_VALUE(IF(delivered_status=1,1,0)) AS delivered,  
    ANY_VALUE(sales_rate) AS sales_rate
  FROM types
  GROUP BY 1, 2
)
SELECT   
  sent_at,
  SUM(submitted) AS submitted,  
  SUM(delivered) AS delivered,  
  SUM(sales_rate) AS sales_rate_total        
FROM latest
GROUP BY 1

It's compact enough to easily manage, no redundancy, no joins at all, etc.
If your table partitioned - you can easily use it by adjusting query just in one place

You can use below dummy data if want to check above query on low volume first

WITH events AS (
  SELECT 1 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:17:48' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
  SELECT 2 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:24:38' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:28:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
  SELECT 3 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:47:48' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:47:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
  SELECT 4 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:54:38' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:48:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
  SELECT 5 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-25 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:50:07' AS sent_at, 0 AS status, 1.000000 AS rate UNION ALL
  SELECT 6 AS id, 'create' AS event_type, TIMESTAMP '2016-11-26 09:17:48' AS event_timestamp, 3 AS message_id, TIMESTAMP '2016-11-26 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
  SELECT 7 AS id, 'create' AS event_type, TIMESTAMP '2016-11-27 09:17:48' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
  SELECT 8 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:50:07' AS sent_at, 0 AS status, 2.000000 AS rate UNION ALL
  SELECT 9 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:55:07' AS sent_at, 0 AS status, 2.000000 AS rate 
)

这篇关于如何根据行类型获取每种类型的最新行并执行计算?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆