我如何使用Google Big Query中的时间间隔加入两个表格? [英] How can I join two tables using intervals in Google Big Query?

查看:191
本文介绍了我如何使用Google Big Query中的时间间隔加入两个表格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您已经找到了一个使用交叉连接在边界框/圆圈内找到区域的解决方案,如下所示:

  SELECT A .ID,C.Car 
FROM Cars C
CROSS JOIN区域A
其中C.Latitude之间A.LatitudeMin和A.LatitudeMax和
C.Longitude之间A.LongitudeMin和A.LongitudeMax

at:
如何使用时间间隔交叉连接大查询



然而,由于基础设施受到限制,使用GBQ操作团队阻止了大数据集的交叉连接。

因此,我的问题是:如何在大数据表中找到一组经纬度(表A),在另一组边界框内,小(表B)?

我的查询如下已被封锁:

 选择a.a1,a.a2,a.mdl,b.name,count(1)从TableMaster计数
a
CROSS加入places_lo (a.lat
在b.bottom_right_lat和b.top_left_lat之间)
和(a.long
在b.top_left_long和b.bottom_right_long之间)
group by ....

TableMaster为538 GB,包含6,658,716,712行(清除/绝对最小值)
places_locations每个查询的变化大约在5到100kb之间。



我试图根据模板来修改虚连接:
如何提高BigQuery中GeoIP查询的性能?



然而,查询需要一个小时,并且不会产生任何结果,也不会显示任何错误。



您可以找出解决这个难题的可能途径吗?

您看到的问题是交叉连接会产生太多的中间值(60亿x 1k = 6万亿次)。



解决此问题的方法是产生更少的产出。如果您有其他可以应用的过滤器,则应在尝试加入之前尝试应用它们。如果你可以在加入之前完成(或部分加入),那也是有帮助的。



此外,为了查找,你可以做一个更粗糙的首先进行查找。也就是说,如果您可以使用具有课程粒度区域的较小表格进行初始交叉连接,那么您可以在区域ID上针对较大的表格进行连接,而不是进行交叉连接。


You have identified a solution of finding an area within a bounding box /circle using cross join as below:

SELECT A.ID, C.Car 
FROM Cars C 
CROSS JOIN Areas A
WHERE C.Latitude BETWEEN A.LatitudeMin AND A.LatitudeMax AND
  C.Longitude BETWEEN A.LongitudeMin AND A.LongitudeMax

at: How to cross join in Big Query using intervals?

however, using cross join for large data sets is blocked by GBQ ops team due to constrains on the infrastructure.
Hence, my question: how could I find set of lat,longs within large data table (table A) that are within another set of bounding boxes , small(table B) ?

My query as below has been blocked:

select a.a1, a.a2 , a.mdl, b.name, count(1) count 
from TableMaster a 
CROSS JOIN places_locations b 
where (a.lat 
    BETWEEN  b.bottom_right_lat AND b.top_left_lat) 
AND (a.long 
    BETWEEN b.top_left_long AND b.bottom_right_long) 
group by ....

TableMaster is 538 GB with 6,658,716,712 rows (cleaned/absolute minimum) places_locations varies per query around 5 to 100kb.

I have tried to adapt fake join based on a template: How to improve performance of GeoIP query in BigQuery?

however, query takes an hour and does not produce any results nor any errors are displayed.

Could you identify a possible path to solve this puzzle at all?

解决方案

The problem you're seeing is that the cross join generates too many intermediate values (6 billion x 1k = 6 trillion).

The way to work around this is to generate fewer outputs. If you have additional filters you can apply, you should try applying them before you do the join. If you could do the group by (or part of it) before the join, that would also help.

Moreover, for doing the lookup, you could do a more coarse-grained lookup first. That is, if you could do an initial cross join with a smaller table that has course grained regions, then you could join against the larger table on region id rather than doing a cross join.

这篇关于我如何使用Google Big Query中的时间间隔加入两个表格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆