BigQuery-时间序列和选择“最新"记录的最有效方法 [英] BigQuery - Time Series and most efficient way to select the 'latest' record

查看：57 发布时间：2020/11/17 6:27:42 google-bigquery

本文介绍了BigQuery-时间序列和选择“最新"记录的最有效方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我们的BQ设计中，我们有一个客户表(嵌套的原始数据)，该表是从我们的微服务层(消耗运动学蒸汽)派生的事件，其中每个事件都有该事件所针对实体的最新实体快照(后处理)更改后的图片).我猜是一种现代的变更数据捕获.

In our BQ design we have a customer table (nested raw data) which is event sourced from our microservices layer (consuming of kinesis steams), where each event has the latest entity snapshot for the entity the event is for (post processing image after the change). A sort of modern change data capture I guess.

每个事件中的最新快照是我们填充BigQuery的方式-提取该快照并将其加载到biqQuery中(通过apache spark结构化连接器通过APPEND ONLY模式). (这与更改和更新给定ID的一行不同)

This latest snapshot in each event is how we populate BigQuery - it is extracted and loaded into biqQuery (via apache spark structured steaming connector) in an APPEND ONLY mode. (This is different to say mutating and updating one row for a given ID)

因此，假设仅添加此表，则该表的大小当然会随着时间的推移而增长-事件中每次更改的条目.但是，它很好地是一个完整的，按时间顺序排列的客户状态和变更(我们需要具备)的序列，因此它是不可变的.我们可以通过重播事件来重建整个仓库，例如……在上下文中足够.

So given this is append only, the size of this table can grow of course over time - an entry per change from an event. However it quite nicely is a full, timeseries of customer state and changes (our requirement to have), and is immutable as such. We can rebuild the full warehouse by replaying the events for example....enough on the context.

这样做的一个后果是，将事实加载到BigQuery中可能会导致重复(例如，如果发生火花错误并重试一个微型批次，则按作业加载时BQ并不是幂等的结构化流宿，或者仅仅是由于其分布式性质而导致的) ). SteamingInserts可能会在以后研究，因为它有助于重复数据删除.....

One consequence of this is the fact that loading into BigQuery may result in duplicates (e.g if spark error and retries a micro batch, BQ isnt an idempotent structured streaming sink when loading by jobs, or just due to distributed nature its generally possible). SteamingInserts might be something to look into later as it helps with deduping.....

这种体系结构的结果是，我需要一个原始时间序列数据(记住有时可能有重复项)的视图，这些视图在这些条件下会返回LATEST记录.

The result of this architecture is that I need a view ontop of the raw time series data (remember can occasionally have duplicates) that returns the LATEST record under these conditions.

最新情况由客户记录(metadata.lastUpdated)上的元数据结构字段确定-带有MAX(metadata.lastUpdated)的行是最新行.这是我们的MS层所保证的.

Latest is determined by a metadata struct field on the customer record (metadata.lastUpdated) - and the row with the MAX(metadata.lastUpdated) is the latest. This is guarneteed by our MS layer.

这也是一个真实的事件时间时间戳.表ID DAY已分区，并具有_PARTITIONTIME列，但这只是摄取时间，我不能使用它.当我可以指定要用作分区时间的列时，那就太好了！ (愿望清单).

This is a true event time timestamp as well. The table id DAY partitioned and has a _PARTITIONTIME column, but this is just an ingest time and I cant use this. Be great when I can specify a column to be used as the partition time! (wishlist).

重复项将是两行，具有相同的客户'id'和'metadata.lastUpdated'-因此MAX(metadata.lastUpdated)可能返回2行，所以我需要使用

A duplicate will be two rows with the SAME customer 'id' AND 'metadata.lastUpdated' - so MAX(metadata.lastUpdated) could return 2 rows, so I need to use

ROW_NUMBER()OVER(PARTITION BY ....所以我可以选择rowNum = 1

ROW_NUMBER() OVER (PARTITION BY .... so I can select the rowNum=1

在我看来，也只能选择1个出现重复的行.

In my view as well to only select 1 row where there is dups.

确定足够的单词/上下文(抱歉)，下面是我的SQL以获得最新的视图.它可以通过我的测试运行，但是当表的大小/行数变大时，我不确定这是实现结果的最有效方法，并且想知道是否有任何BigQuery棺材可能更有效/更聪明SQL要这样做吗?为什么说SQL没问题，但绝不是性能调优的专家，可以肯定，尤其是为BQ性能调优做SQL的最佳方法.

Ok so enough words/context (sorry), below is my view SQL to get the latest. It works from my tests, but I am not sure it is the most efficient way to achieve my outcome when the size of the table / number of rows gets large, and was wondering if any BigQuery boffins out there might have a more efficient / clever SQL to do this? Why SQL is OK, but by no means an expert in performance tuning for sure and in particular the best ways to do SQL for BQ perf tunning.

我只是希望能够将所有数据保存在一个表中，并依靠dremel引擎的功能来查询它，而不是需要多个表或执行任何过于复杂的操作.

I was just hoping to be able to have all the data in one table and rely on the power of the dremel engine to just query it, rather then needing to have multiple tables or do anyting too complex.

所以我的SQL在下面.注意-我的时间戳记是作为字符串提取的，因此也需要在视图中对其进行解析.

SO my SQL is below. Note - my timestamp is ingested as a string, so need to PARSE this in the view too.

WITH
  cus_latest_watermark AS (
  SELECT
    id,
    MAX(PARSE_TIMESTAMP("%Y-%m-%dT%H:%M:%E*S%Ez", metadata.lastUpdated)) AS maxLastUpdatedTimestampUTC
  FROM
    `project.dataset.customer_refdata`
  GROUP BY
    id ),
  cust_latest_rec_dup AS (
  SELECT
    cus.*,
    ROW_NUMBER() OVER (PARTITION BY cus.id ORDER BY cus.id) AS rowNum
  FROM
    `project.dataset.customer_refdata` cus
  JOIN
    cus_latest_watermark
  ON
    cus.id = cus_latest_watermark.id
    AND PARSE_TIMESTAMP("%Y-%m-%dT%H:%M:%E*S%Ez", cus.metadata.lastUpdated) = cus_latest_watermark.maxLastUpdatedTimestampUTC)
SELECT
  cust_latest_rec_dup.* EXCEPT(rowNum)
FROM
  cust_latest_rec_dup
WHERE
  rowNum = 1

谢谢！

BigQuery-时间序列和选择“最新"记录的最有效方法 [英] BigQuery - Time Series and most efficient way to select the 'latest' record

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

BigQuery-时间序列和选择“最新"记录的最有效方法 [英] BigQuery - Time Series and most efficient way to select the &#39;latest&#39; record

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

BigQuery-时间序列和选择“最新"记录的最有效方法 [英] BigQuery - Time Series and most efficient way to select the 'latest' record

登录关闭