Google BigQuery价格优化 [英] Google BigQuery Price Optimization

查看:658
本文介绍了Google BigQuery价格优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我在BigQuery中创建了一个表格,我们称之为user_activities_data。

我使用Google BigQuery将用户数据活动存储在我的应用程序中。 p>

这些数据包括许多数据用户活动,每秒都会更新一次。每秒钟都会有数千个新数据插入表中。



我对这张表做了很多查询。



我知道BigQuery的收费标准是您为每个查询使用了多少数据。我已经尝试优化我的查询,尽可能少地使用数据大小(通过只选择需要的列)。



例如:

  SELECT用户名,活动FROM user_activities_data WHERE date_activities> ='2016-10-01'和date_activities< ='2016-10-31'

但是由于每天我的数据量不断增加,定价也在增加。尽管我使用相同的查询,但限制条件相同(例如上面的查询示例)。

所以我的问题是优化定价的最优方法是什么?

上个月每个查询我将收取0.2美元左右的费用,本月因为我的数据量增加了一倍,所以我每次查询的费用为0.4美元。我每天都会做数百次查询,所以如果管理不好,定价可能会非常巨大。



我有以下几种选择:


  1. 在BigQuery中使用每日分区选项,但我不知道我是否正确执行了
    的定价(按处理数据的数量)是相同的。 (我在这里做错了什么?)

  2. 每月将表分成多个表,如:
    user_data_activities_oct16,user_data_activities_nov16等。
    注意:有些时候我需要查询月份之间的数据,例如2016年8月31日至2016年11月5日期间的
    用户活动的数据,这是
    a好的解决方案吗?


我打开任何建议:)

非常感谢你

解决方案

您需要使用分区表



例如,您需要创建一个按天分类的表。然后你可以写入每天的数据,比如 mydataset.table $ 20160519 ,你会这样查询:

  SELECT 
field1
FROM
mydataset.partitioned_table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP(2016-05-01)
AND TIMESTAMP(2016-05-06)

当您看到伪列 _PARTITIONTIME 是限制查询只触摸分区的方法。这意味着查询价格将仅计算在所提到的日期间隔期间触及的数据,而不是整个表格。这也不会阻止你做全表查询。


I'm using Google BigQuery to store the user data activities at my app.

I created a table in BigQuery, lets call it "user_activities_data".

This data including many data user activities that always be updated every second. Every second there will be thousands of new data inserted to the table.

I do a lot of query to this table.

I know that BigQuery is charged by how much data that you use for every query you do. I already try to optimize my query to use as low as possible using the data size (by selecting only needed column).

For example:

SELECT username,activity FROM user_activities_data WHERE date_activities>='2016-10-01' and date_activities<='2016-10-31'

But because every day my number of data always increasing, the pricing become increasing too. Even though I use the same query with the same where limitation (like the example of query above).

So my question is what is the most optimal way to optimize the pricing?

Last month for each query I will be charged around $0.2, and this month because the amount of my data has been doubled I got bill $0.4 per query. I did around hundreds of queries every day so the pricing can be so huge if not managed well.

I have several options:

  1. Use daily partition option at BigQuery, but I don't know if I did it correctly the pricing (by number of data processed) is the same. (Did I do something wrong at this?)

  2. Divide the table into many tables per month, like: user_data_activities_oct16, user_data_activities_nov16 and more. Note: Some times I need to query data between months like data of user activities from August 31th 2016 to November 5th 2016, is this a good solution?

I open for any suggest :)

Thank you very much

解决方案

You need to use Partitioned Tables.

You need to create a table that is type partitioned by day for example. And then you can write into each day data like this mydataset.table$20160519, and you will query like this:

SELECT
  field1
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP("2016-05-01")
  AND TIMESTAMP("2016-05-06")

as you see the pseudo column _PARTITIONTIME is the way to restrict the query to touch only a partition. This means that query price will be calculated only to the data that is touched during the mentioned date interval, and not to the whole table. This also doesn't stop you to do your full table queries.

这篇关于Google BigQuery价格优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆