Google BigQuery定价 [英] Google BigQuery pricing

查看:496
本文介绍了Google BigQuery定价的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新加坡管理大学的博士生。目前我正在卡内基梅隆大学就一项研究项目工作,这项研究项目需要Github Archive的历史活动( http:// www。 githubarchive.org/ )。我注意到Google Bigquery有Github Archive数据。所以我运行一个程序来使用Google Bigquery服务来抓取数据。

我刚刚发现Google Bigquery在控制台上显示的价格并未实时更新......当我开始运行该程序几个小时时,费用只有4美元加上,所以我认为价格合理,我一直在运行该程序。 1〜2天后,我在2013年9月13日再次查看价格,价格变为1388美元...因此我立即停止使用Google Bigquery服务。而刚才我再次检查了价格,结果证明我需要支付4179美元...



这是我的错,我没有意识到我需要支付这一大笔资金用于执行查询并从Google bigquery获取数据。

这个项目只用于研究,而不是商业用途。我想知道是否有可能免收费用。我真的需要[Google Bigquery团队]的帮助。



非常感谢你和&最好的问候,
Lisa

解决方案



请注意自这种情况以来的一些重大进展:


  • 查询价格降低了85%。

  • GithubArchive现在每天和每年都会发布一个表 - 所以在开发查询时总是在较小的数据集上进行测试。






BigQuery定价基于查询的数据量。它的亮点之一就是如何轻松扩展,即在几秒钟内从几千兆字节扫描到兆兆字节。

线性定价缩放是一项功能:大多数(或全部)?我知道的其他数据库将需要指数级更昂贵的资源,或者只是无法处理这些数据量 - 至少不是在合理的时间范围内。

线性缩放意味着超过千兆字节的查询比查询超过千兆字节要贵1000倍。 BigQuery用户需要了解这一点并进行相应的规划。为了达到这些目的,BigQuery提供了空运行标志,允许用户在运行查询之前准确查看要查询的数据量 - 并进行相应的调整。 威刚正在查询一张105 GB的桌子。十个 SELECT * LIMIT 10 查询将很快达到TB级数据,等等。



使这些相同的查询消耗的数据少得多:




  • 而不是查询 SELECT * LIMIT 10 ,只调用你正在寻找的列。 BigQuery根据您查询的列进行收费,因此添加不必要的列会增加不必要的成本。 $ b c> SELECT * ... queries 105 GB,while SELECT repository_url,repository_name,payload_ref_type,payload_pull_request_deletions FROM [githubarchive:github.timeline] only only goes通过8.72 GB,使这个查询超过10倍的便宜。




    • 而不是SELECT *使用tabledata.list时希望下载整个表格。它是免费的。

    • Github归档表包含所有时间的数据。如果您只想查看一个月的数据,则将其分区。




    例如,查询留下仅91.7 MB的新表。查询这张桌子比大桌子贵一千倍!

      SELECT * 
    FROM [githubarchive:github。时间表]
    WHERE created_at BETWEEN'2014-01-01'和'2014-01-02'
    - >将这些保存到一个新表中'timeline_201401'

    结合这些方法,您可以从4000美元的账单转到一个4美元一个,为相同数量的快速和有见地的结果。

    (我正在与Github存档的所有者合作,让他们存储每月数据而不是一个单一的表让这更容易)


    I'm a Phd student from Singapore Management University. Currently I'm working in Carnegie Mellon University on a research project which needs the historical events from Github Archive (http://www.githubarchive.org/). I noticed that Google Bigquery has Github Archive data. So I run a program to crawl data using Google Bigquery service.

    I just found that the price of Google bigquery shows on the console is not updated in real-time... While I started running the program for a few hours, the fee was only 4 dollar plus, so I thought the price is reasonable and I kept running the program. After 1~2 days, I checked the price again on Sep 13, 2013, the price became 1388$...I therefore immediately stopped using Google bigquery service. And just now I checked the price again, it turns out I need to pay 4179$...

    It is my fault that I didn't realize I need to pay this big amount of money for executing queries and obtaining data from Google bigquery.

    This project is only for research, not for commercial purpose. I would like to know whether it is possible to waive the fee. I really need [Google Bigquery team]'s kindly help.

    Thank you very much & Best Regards, Lisa

    解决方案

    A year later update:

    Please note some big developments since this situation:

    • Querying prices are 85% down.
    • GithubArchive is publishing daily and yearly tables now - so while developing your queries always test them on smaller datasets.

    BigQuery pricing is based on the amount of data queried. One of its highlights is how easily it scales, going from scanning few gigabytes to terabytes in very few seconds.

    Pricing scaling linearly is a feature: Most (or all?) other databases I know of would require exponentially more expensive resources, or are just not able to handle these amounts of data - at least not in a reasonable time frame.

    That said, linear scaling means that a query over a terabyte is a 1000 times more expensive than a query over a gigabyte. BigQuery users need to be aware of this and plan accordingly. For these purposes BigQuery offers the "dry run" flag, that allows one to see exactly how much data will be queried before running the query - and adjust accordingly.

    In this case WeiGong was querying a 105 GB table. Ten SELECT * LIMIT 10 queries will quickly amount to a terabyte of data, and so on.

    There are ways to make these same queries consume much less data:

    • Instead of querying SELECT * LIMIT 10, call only the columns you are looking for. BigQuery charges based on the columns you are querying, so having unnecessary columns, will add unnecessary costs.

    For example, SELECT * ... queries 105 GB, while SELECT repository_url, repository_name, payload_ref_type, payload_pull_request_deletions FROM [githubarchive:github.timeline] only goes through 8.72 GB, making this query more than 10 times less expensive.

    • Instead of "SELECT *" use tabledata.list when looking to download the whole table. It's free.

    • Github archive table contains data for all time. Partition it if you only want to see one month data.

    For example, extracting all of the January data with a query leaves a new table of only 91.7 MB. Querying this table is a thousand times less expensive than the big one!

    SELECT *
    FROM [githubarchive:github.timeline]
    WHERE created_at BETWEEN '2014-01-01' and '2014-01-02'
    -> save this into a new table 'timeline_201401'
    

    Combining these methods you can go from a $4000 bill, to a $4 one, for the same amount of quick and insightful results.

    (I'm working with Github archive's owner to get them to store monthly data instead of one monolithic table to make this even easier)

    这篇关于Google BigQuery定价的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆