Google BigQuery:检索每行的最新版本 [英] Google BigQuery: retrieve last version of each row

查看:94
本文介绍了Google BigQuery:检索每行的最新版本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Google BigQuery表,其中包含所有版本的资源.每次创建/更新/删除资源时,都会添加一个新行,以增加版本号(此数字将是添加行时的时间戳)

I have a Google BigQuery Table which contains all the versions of resources. Every time a resource is created/updated/deleted a new row is added incrementing the version number (this number will be the timestamp of when the row is added)

+-------+------------+--------+-------+-------------+
|  ID   | ResourceID | Action | Count |  Timestamp  |
+-------+------------+--------+-------+-------------+
| ABC_1 | ABC        | CREATE |    10 | {timestamp} |
| ABC_2 | ABC        | UPDATE |     8 | {timestamp} |
| ABC_3 | ABC        | UPDATE |     4 | {timestamp} |
| ABC_4 | ABC        | DELETE |     4 | {timestamp} |
| -     |            |        |       |             |
| DEF_1 | DEF        | CREATE |    10 | {timestamp} |
| DEF_2 | DEF        | DELETE |    10 | {timestamp} |
| -     |            |        |       |             |
| GHJ_1 | GHJ        | CREATE |    10 | {timestamp} |
| -     |            |        |       |             |
| KLM_1 | KLM        | CREATE |    10 | {timestamp} |
| KLM_2 | KLM        | UPDATE |     5 | {timestamp} |
+-------+------------+--------+-------+-------------+

  • ID :该行的唯一ID,其中包含ResourceID和版本标识符
  • ResourceID :发生操作的资源的ID
  • 动作:该动作发生在资源上
  • 计数:与资源关联的值
  • 时间戳:添加行的时间戳(与唯一ID相同)
    • ID: a unique ID of the row, which contains the ResourceID plus the version identifier
    • ResourceID: the ID of the resource where an action occured
    • Action: The action occured on the resource
    • Count: The value associated to the resource
    • Timestamp: The timestamp of when the row has been added (which is the same attached to the unique ID)
    • 我需要一个组合查询,以检索每种资源的所有最新版本

      I need a compose a query which retrieve all the last versions of each resource

      +-------+------------+--------+-------+-------------+
      |  ID   | ResourceID | Action | Count |  Timestamp  |
      +-------+------------+--------+-------+-------------+
      | ABC_4 | ABC        | DELETE |     4 | {timestamp} |
      | DEF_2 | DEF        | DELETE |    10 | {timestamp} |
      | GHJ_1 | GHJ        | CREATE |    10 | {timestamp} |
      | KLM_2 | KLM        | UPDATE |     5 | {timestamp} |
      +-------+------------+--------+-------+-------------+
      

      此外,所有处于DELETE状态的资源都需要忽略. 所以这是我想要的最终输出

      In addition, all the resource which are in DELETE status, need to be ignored. So here is the final output I'm looking for

      +-------+------------+--------+-------+-------------+
      |  ID   | ResourceID | Action | Count |  Timestamp  |
      +-------+------------+--------+-------+-------------+
      | GHJ_1 | GHJ        | CREATE |    10 | {timestamp} |
      | KLM_2 | KLM        | UPDATE |     5 | {timestamp} |
      +-------+------------+--------+-------+-------------+
      

      这是我进行的查询

      SELECT ResourceId, Count
      FROM worklog_*
      WHERE ID IN (
          SELECT max(ID)
          FROM worklog_*
          GROUP BY WorklogID
      ) AND Action != DELETE
      

      这不是真正的BigQuery查询,但足以了解其行为. 如果可以比较ID列的值,则此查询工作正常,这就是为什么我选择加入ResourceId和Timestamp的原因,MAX()值将始终提供最后一个状态

      It is not a true BigQuery query but it's enough to understand the behaviour. This query works fine if the values of the ID column can be compared, this is why I choose to join ResourceId and Timestamp, the MAX() value will always provide the last status

      这是最好的方法吗?没有人对采用这种提取方法有更好的建议吗?

      Is this the best approach? Does anynone have a suggestion on a better way to do this kind of extraction?

      推荐答案

      对于BigQuery标准SQL

      For BigQuery Standard SQL

      #standardSQL
      WITH worklog AS (
        SELECT 'ABC_1' AS ID, 'ABC' AS ResourceID, 'CREATE' AS Action, 10 AS COUNT UNION ALL
        SELECT 'ABC_2', 'ABC', 'UPDATE', 8 UNION ALL
        SELECT 'ABC_3', 'ABC', 'UPDATE', 4 UNION ALL
        SELECT 'ABC_4', 'ABC', 'DELETE', 4 UNION ALL
        SELECT 'DEF_1', 'DEF', 'CREATE', 10 UNION ALL
        SELECT 'DEF_2', 'DEF', 'DELETE', 10 UNION ALL
        SELECT 'GHJ_1', 'GHJ', 'CREATE', 10 UNION ALL
        SELECT 'KLM_1', 'KLM', 'CREATE', 10 UNION ALL
        SELECT 'KLM_2', 'KLM', 'UPDATE', 5 
      )
      SELECT * EXCEPT(Last)
      FROM (
        SELECT *,
          ROW_NUMBER() OVER(PARTITION BY ResourceID ORDER BY ID DESC) AS Last
        FROM worklog
        WHERE Action != 'DELETE'
      )
      WHERE Last = 1
      -- ORDER BY ID
      

      这篇关于Google BigQuery:检索每行的最新版本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆