BigQuery-仅使用键值模式查询表中的键子集 [英] BigQuery - querying only a subset of keys in a table with key value schema

查看:80
本文介绍了BigQuery-仅使用键值模式查询表中的键子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个具有以下架构的表:

So I have a table with the following schema:

timestamp: TIMESTAMP
key: STRING
value: FLOAT

大约有200个唯一键.我正在按日期对数据集进行分区.

There are around 200 unique keys. I am partitioning the dataset by date.

我想每天在此数据库上运行几个查询(当前为5-6个,但我希望至少再添加15个).蛮力逼迫这些每天会花费我很多钱,我想避免这种情况.

I want to run several (5-6 currently, but I expect to add at least 15 more) queries on a daily basis on this database. Brute forcing these would cost me a lot daily, which I want to avoid.

问题在于,由于这种键-值格式,并且BigQuery是列式数据库,因此尽管每个查询实际上最多使用4个键,但每个查询都查询整天的数据.对此进行优化的最佳方法是什么?

The issue is that because of this key - value format, and BigQuery being a columnar database, each query queries the whole day's data, despite each query actually using a maximum of 4 keys. What is a best way to optimize this?

我认为现在最好的方法是为每个密钥创建单独的临时表,作为日常批处理过程,对它们运行查询,然后删除它们.

I am thinking the best way I can go about it right now is to create separate temp tables for each key as a daily batch process, run my queries on them and then delete them.

我想采用的理想方法是按键分区,我不确定是否有这样的规定?

Ideal way I would want to go about it is partitioning by key, I am not sure there is any such provision?

推荐答案

您可以尝试使用最近推出的

You can try using recently introduced clustering partitioned tables

在BigQuery中创建集群表格时,表格数据是根据表格架构中一个或多个列的内容自动组织的.您指定的列用于并置相关数据.当您使用多个列对一个表进行聚类时,指定的列顺序很重要.指定列的顺序确定数据的排序顺序.

When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.

聚类可以提高某些类型的查询的性能,例如使用过滤器子句的查询和聚合数据的查询.当查询作业或装入作业将数据写入群集表时,BigQuery会使用群集列中的值对数据进行排序.这些值用于将数据组织到BigQuery存储中的多个块中.当您提交包含根据聚类列过滤数据的子句的查询时,BigQuery使用排序的块来消除对不必要数据的扫描.

Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.

类似地,当您提交基于聚类列中的值聚合数据的查询时,由于排序的块将具有相似值的行并置在一起,因此性能得到了改善.

Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.


更新(已从评论中移出)

还请记住以下

Feature          Partitioning   Clustering
---------------  -------------  -------------
Cardinality      Less than 10k  Unlimited    
Dry Run Pricing  Available      Not available    
Query Pricing    Exact          Best Effort  

要特别注意Dry Run Pricing-不幸的是-群集表不支持基于群集键的空运行(验证)-而是仅显示基于分区的验证.但是,如果您正确设置集群,则实际运行将以较低的成本完成.您应该尝试使用较小的数据以对此感到满意

Pay special attention to Dry Run Pricing - unfortunately - clustered tables do not support dry run (validation) based on clustered keys - and rather show only validation based on partitions. but if you set your clustering properly - actual run will end up with lower cost. you should try with smaller data to get comfortable with this

Clustering partitioned tables

这篇关于BigQuery-仅使用键值模式查询表中的键子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆