BigQuery:丛集表格在串流插入时仍保持排序吗? [英] BigQuery: Do clustered tables remain sorted in the face of streaming inserts?

查看:40
本文介绍了BigQuery:丛集表格在串流插入时仍保持排序吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我每小时执行一次批处理作业,需要扫描最近一小时流到我表中的所有数据.现在,我使用的是按日期分区的表,这意味着每次我扫描日期分区以获取一小时的数据时,都必须扫描当天所有时间的行.

I have hourly batch jobs that need to scan all the data that has streamed into my table in the last hour. Right now I'm using a date-partitioned table, which means that every time I scan a date partition for an hour's worth of data, I have to scan rows from all hours of that day.

我一直在考虑在小时字段上对该表进行聚类,但是我给人的印象是,BigQuery在面对流插入时实际上不会使该表有效地聚类.所以这是我的问题:

I've been thinking about clustering this table on an hour field, however I'm under the impression that BigQuery won't actually keep the table effectively clustered in the face of streaming inserts. So here's my question:

BigQuery是否保证即使在流插入的情况下也能使聚簇表保持排序?

Does BigQuery guarantee to keep clustered tables sorted even in the face of streaming inserts?

推荐答案

当前答案是否,面对流插入,聚簇表不会保持排序/聚簇.非常感谢塔米尔(Tamir)指出,在此处,有一个与此问题相关的答案.请查看该答案以获取详细信息,以及在分区的一部分上强制排序的技巧.

Currently the answer is no, clustered tables do not remain sorted/clustered in the face of streaming inserts. Many thanks to Tamir for pointing out that there's an answer relevant to this question here. Check that answer out for details as well as a trick to force sorting on part of a partition.

BigQuery小组也正在为此进行工作.根据2019年4月17日的此问题跟踪器评论:

It also looks like the BigQuery team is working on this. According this issue tracker comment from April 17, 2019:

我们正在对流进行大量工作,以使表群集到最近的某个时间间隔.目前,我们尚无很好的预计到达时间,但我们希望尽快获得更多信息.

We are doing some a fair amount of work with streaming to keep the table clustered upto a certain recent time interval. We don't have a good ETA to offer on this at this point, but we hope to have more information on this soon.

这篇关于BigQuery:丛集表格在串流插入时仍保持排序吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆