加入Google Bigquery [英] Joins on Google Bigquery

查看:131
本文介绍了加入Google Bigquery的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道正在做的工作是为了改进Bigquery上的Join功能,而不是在这里咆哮,但如果Joins无法正确使用,将很难分析'Terabyte'数据集为'广告'。



好的,回到问题中,我有两个表格,一个是600 Megs,另一个是50Megs,我确实尝试了一个连接,并且出现了关于较小表格的错误被留下。我做了一些调查,发现 Bigquery 认为这两个表大于7MB?



所以根据我在网上找到的一些建议,我选择了较小表格中存在问题的数据集,并将其保存在一个新表格中,新数据集在两列上分别为12MB和600K行。然后我再次尝试查询,但仍然收到相同的错误:

 查询失败:错误:大表cdrs_test.geoIP_Left必须显示为连接查询中最左边的表格

geoIP_Left是12 Megs,另一个表格是600 Megs ..

这可以以某种方式解决,或者我是否被捣毁?如果有的话,任何人都知道我可以用来分析支持连接的大型数据集的其他服务?

编辑:这是实际的查询;

  SELECT COUNT(results.cc_card)AS计数
sum(results.sessiontime)AS时间
geoIP_Left.place AS place
FROM cdrs_test.cdrs_2010_5 AS results
JOIN cdrs_test.geoIP_Left AS geoIP_table
ON results.cc_card = geoIP_table.vcard
WHERE results.sessiontime> 0 AND results.countryName包含'India'
GROUP BY place;


解决方案

您可以将此查询表示为只引用您感兴趣的 cdrs_test.geoIP_Left 列。请参阅第二个示例

  SELECT 
COUNT(results.cc_card)AS count,
sum(results.sessiontime)AS time,
geoIP_table.place AS place
FROM
cdrs_test.cdrs_2010_5 AS结果
JOIN
(SELECT place,vcard FROM cdrs_test.geoIP_Left)
AS
geoIP_table
ON
results.cc_card = geoIP_table.vcard
WHERE
results.sessiontime> 0 AND results.countryName CONTAINS'印度'
GROUP BY
地方;

您也可以简单地运行多个查询 - 请注意,您可以将查询结果显式保存为命名表,并在稍后的查询中使用该表。

最后,另一种选择是预加入数据,使用(例如)MapReduce转换管道或<其中一个BigQuery ETL合作伙伴提供了一个href =https://developers.google.com/bigquery/docs/third_party_tools> ETL工具


I know that work is being done to improve the Join feature on Bigquery, not to rant here but it will be hard to analyze 'Terabyte' sets of data as 'advertised' if Joins can not be used properly.

OK, back to the problem, I have two tables one is 600 Megs and the other one is 50Megs, I did try to make a join and I got an error about smaller table must be left. I did some research and I found out that Bigquery considers both tables as big if they are greater than 7MB?

So based on some advice I found online I did select the dataset in question of the smaller table and saved it in a new table, the new data set is 12MB and 600K rows over two columns. Then I tried my query again but I am still getting the same error :

Query Failed : Error: Large table cdrs_test.geoIP_Left must appear as the leftmost table in a join query

geoIP_Left is 12 Megs, the other table is 600Megs..

Can this be fixed somehow or am I busted? If so anyone knows about any other service that I can use to analyse big sets of data that does support joins?

EDIT: This is the actual query;

SELECT COUNT(results.cc_card) AS count,
       sum(results.sessiontime) AS time, 
       geoIP_Left.place AS place 
FROM cdrs_test.cdrs_2010_5 AS results 
JOIN cdrs_test.geoIP_Left AS geoIP_table 
  ON results.cc_card = geoIP_table.vcard 
WHERE results.sessiontime > 0 AND results.countryName Contains 'India' 
GROUP BY place;

解决方案

You could express this query as a subselect that references only the columns from cdrs_test.geoIP_Left that you are interested in. See the second example here.

SELECT
  COUNT(results.cc_card) AS count,
  sum(results.sessiontime) AS time, 
  geoIP_table.place AS place 
FROM
  cdrs_test.cdrs_2010_5 AS results 
JOIN
  (SELECT place, vcard FROM cdrs_test.geoIP_Left)
AS
  geoIP_table 
ON
  results.cc_card = geoIP_table.vcard 
WHERE
  results.sessiontime > 0 AND results.countryName CONTAINS 'India' 
GROUP BY
  place;

You can also simply run multiple queries - note that you can save query results explicitly as a named table, and use that table in a later query.

Finally, another option would be to pre-join the data, using (for example) a MapReduce transformation pipeline, or an ETL tool provided by one of the BigQuery ETL partners.

这篇关于加入Google Bigquery的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆