如何有效地加入两张大桌子 [英] How to left out join two big tables effectively

查看：167 发布时间：2018/5/7 17:41:20 mapreduce google-bigquery pyspark-sql

本文介绍了如何有效地加入两张大桌子的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个表，table_a和table_b，
，table_a包含216646500行，7155998163个字节;
table_b包含1462775行，2096277141个字节

table_a的模式是：c_1，c_2，c_3，c_4;
table_b的模式是：c_2，c_5，c_6，...（约10列）

我想做一个left_outer连接两个表key col_2，但它已运行16个小时，尚未完成...
pyspark代码如下：

  combine_table = table_a.join（table_b，table_a.col_2 == table_b.col_2，'left_outer'）。collect（）

是否有任何有效的方法可以像这样连接两个大表？ 解决方案

注意爆炸连接。

使用一个开放的数据集，这个查询不会在合理的时间内运行：

#standardSQL SELECT COUNT（*） FROM`fh-bigquery.reddit_posts.2017_06` a JOIN`fh-bigquery.reddit_comments.2017_06` b ON a.subreddit = b.subreddit
如果我们摆脱前100个连接键（$）
$ b
#standardSQL SELECT COUNT（*） FROM（ SELECT * FROM`fh-bigquery.reddit_posts.2017_06` WHERE subreddit NOT IN（SELECT value FROM UNNEST（（ SELECT APPROX_TOP_COUNT（subreddit，100）s FROM`fh- bigquery.reddit_posts.2017_06` ））））a JOIN（ SELECT * FROM`fh-bigquery.reddit_comments.2017_06` b where subreddit NOT IN（SELECT value FROM UNNEST（（ SELECT APPROX_TOP_COUNT（subreddit，100）s 从`fh-bigquery.reddit_comments.2017_06` ））））b ON a.subreddit = b.subreddit
此修改后的查询在70秒内运行，结果为：
90508538331
900亿。这是一个爆炸性的连接。我们在一张桌子上有900万行，在第二张桌子上有8000万行，而我们的连接产生了900亿行 - 即使在消除了每一面的100个关键字之后。

在数据中 - 查找可能产生太多结果的任何键，并在生成连接之前删除它（有时它是默认值，如 null ）

I have two tables, table_a and table_b, table_a contains 216646500 rows, 7155998163 bytes; table_b contains 1462775 rows, 2096277141 bytes

table_a's schema is: c_1, c_2, c_3, c_4 ; table_b's schema is: c_2, c_5, c_6, ... (about 10 columns)

I want to do a left_outer join the two tables on the same key col_2, but it has run for 16 hours and hasn't finished yet... The pyspark code is as follow:
combine_table = table_a.join(table_b, table_a.col_2 == table_b.col_2, 'left_outer').collect()
Is there any effictive way to join two big tables like this?
解决方案
Beware of exploding joins.

Working with an open dataset, this query won't run in a reasonable time:
#standardSQL SELECT COUNT(*) FROM `fh-bigquery.reddit_posts.2017_06` a JOIN `fh-bigquery.reddit_comments.2017_06` b ON a.subreddit=b.subreddit
What if we get rid of the top 100 joining keys from each side?
#standardSQL SELECT COUNT(*) FROM ( SELECT * FROM `fh-bigquery.reddit_posts.2017_06` WHERE subreddit NOT IN (SELECT value FROM UNNEST(( SELECT APPROX_TOP_COUNT(subreddit, 100) s FROM `fh-bigquery.reddit_posts.2017_06` )))) a JOIN ( SELECT * FROM `fh-bigquery.reddit_comments.2017_06` b WHERE subreddit NOT IN (SELECT value FROM UNNEST(( SELECT APPROX_TOP_COUNT(subreddit, 100) s FROM `fh-bigquery.reddit_comments.2017_06` )))) b ON a.subreddit=b.subreddit
This modified query ran in 70 seconds, and the result was:
90508538331
90 billion. That's a exploding join. We had 9 million rows in one table, 80 million rows in the second one, and our join produced 90 billion rows - even after eliminating the top 100 keys from each side.

In your data - look for any key that could be producing too many results, and remove it before producing the join (sometimes it's a default value, like null)

这篇关于如何有效地加入两张大桌子的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何有效地加入两张大桌子 [英] How to left out join two big tables effectively

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何有效地加入两张大桌子 [英] How to left out join two big tables effectively

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭