我想要一个“物化视图",最新记录 [英] I want a "materialized view" of the latest records
问题描述
随着我不断向BigQuery追加行,我希望每个ID都有最新行的物化视图".
As I keep appending rows to BigQuery, I'd like to have a "materialized view" of the latest row for each id.
我该怎么做?
推荐答案
2018-10: BigQuery doesn't support materialized views, but you can use this approach:
假设您想要一个具有每行最新信息的表,并且希望对其进行更新-这样,任何查询的人都可以轻松访问最新行,而不必扫描整个仅附录表.
Let's say you want a table with the latest info for each row, and you want to keep it updated - so anyone querying can easily access the latest row without having to scan the whole append-only table.
在此示例中,我将使用 Wikipedia集群日志-然后我将创建一个表格,其中所有以"A"开头的英文页面的最新行.这些限制使出于此演示目的,我的查询变得更快,更小.
For this example I'll use my Wikipedia clustered logs - and I'll create a table with the latest rows of all English pages that start with 'A'. These restrictions make my queries faster and smaller for this demo purposes.
让我们首先创建表:
CREATE TABLE `wikipedia_vt.just_latest_rows` AS
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE datehour BETWEEN "2018-10-18" AND "2018-10-21"
AND wiki='en' AND title LIKE 'A%'
GROUP BY title
)
现在我想用自该日期以来收到的所有新行进行更新:
And now I want to update it with all the new rows received since that date:
MERGE `wikipedia_vt.just_latest_rows` T
# our "materialized view"
USING (
SELECT latest_row.*
FROM (
SELECT ARRAY_AGG(a ORDER BY datehour DESC LIMIT 1)[OFFSET(0)] latest_row
FROM `fh-bigquery.wikipedia_v3.pageviews_2018` a
WHERE datehour > TIMESTAMP_SUB(@run_time, INTERVAL 1 DAY )
# change to CURRENT_TIMESTAMP() or let scheduled queries do it
AND datehour > '2000-01-01' # nag
AND wiki='en' AND title LIKE 'A%'
GROUP BY title
)
) S
ON T.title = S.title
WHEN MATCHED THEN
# if the row is there, we update the views and time
UPDATE SET views = S.views, datehour=S.datehour
WHEN NOT MATCHED BY TARGET THEN
# if the row is not there, we insert it
INSERT (datehour, wiki, title, views) VALUES (datehour, wiki, title, views)
现在,您应该设置一个过程以定期运行此查询.为了降低查询成本,请确保该过程更改了更新的开始日期.
Now you should set up a process to run this query periodically. To keep querying costs down, make sure the process changes the starting date for updates.
设置此过程的一种简单方法是使用新的BigQuery计划查询,该查询将使用当前时间戳替换@run_time.
A simple way to set up this process is to use the new BigQuery Scheduled Queries, which will replace @run_time with the current timestamp.
要创建将这种方法与最新记录的实时视图结合起来的视图,请参见:
To create a view that combines this approach with a real-time view of the latest records, see:
这篇关于我想要一个“物化视图",最新记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!