我想要一个“物化视图",最新记录 [英] I want a "materialized view" of the latest records

查看:63
本文介绍了我想要一个“物化视图",最新记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

随着我不断向BigQuery追加行,我希望每个ID都有最新行的物化视图".

As I keep appending rows to BigQuery, I'd like to have a "materialized view" of the latest row for each id.

我该怎么做?

推荐答案

2018-10:BigQuery不支持

2018-10: BigQuery doesn't support materialized views, but you can use this approach:

假设您想要一个具有每行最新信息的表,并且希望对其进行更新-这样,任何查询的人都可以轻松访问最新行,而不必扫描整个仅附录表.

Let's say you want a table with the latest info for each row, and you want to keep it updated - so anyone querying can easily access the latest row without having to scan the whole append-only table.

在此示例中,我将使用 Wikipedia集群日志-然后我将创建一个表格,其中所有以"A"开头的英文页面的最新行.这些限制使出于此演示目的,我的查询变得更快,更小.

For this example I'll use my Wikipedia clustered logs - and I'll create a table with the latest rows of all English pages that start with 'A'. These restrictions make my queries faster and smaller for this demo purposes.

让我们首先创建表:

CREATE TABLE `wikipedia_vt.just_latest_rows` AS
SELECT latest_row.* 
FROM (
  SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
  FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
  WHERE datehour BETWEEN "2018-10-18" AND "2018-10-21" 
  AND wiki='en' AND title LIKE 'A%'
  GROUP BY title
)

现在我想用自该日期以来收到的所有新行进行更新:

And now I want to update it with all the new rows received since that date:

MERGE `wikipedia_vt.just_latest_rows` T
# our "materialized view"
USING  (
  SELECT latest_row.* 
  FROM (
    SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
    FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
    WHERE datehour > TIMESTAMP_SUB(@run_time, INTERVAL 1 DAY )
    # change to CURRENT_TIMESTAMP() or let scheduled queries do it
    AND datehour > '2000-01-01' # nag
    AND wiki='en' AND title LIKE 'A%'
    GROUP BY title
  )
) S
ON T.title = S.title

WHEN MATCHED THEN
  # if the row is there, we update the views and time
  UPDATE SET views = S.views, datehour=S.datehour

WHEN NOT MATCHED BY TARGET THEN
  # if the row is not there, we insert it 
  INSERT (datehour, wiki, title, views) VALUES (datehour, wiki, title, views)

现在,您应该设置一个过程以定期运行此查询.为了降低查询成本,请确保该过程更改了更新的开始日期.

Now you should set up a process to run this query periodically. To keep querying costs down, make sure the process changes the starting date for updates.

设置此过程的一种简单方法是使用新的BigQuery计划查询,该查询将使用当前时间戳替换@run_time.

A simple way to set up this process is to use the new BigQuery Scheduled Queries, which will replace @run_time with the current timestamp.

要创建将这种方法与最新记录的实时视图结合起来的视图,请参见:

To create a view that combines this approach with a real-time view of the latest records, see:

这篇关于我想要一个“物化视图",最新记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆