如何有效地省略连接两个大表 [英] How to left out join two big tables effectively

查看:31
本文介绍了如何有效地省略连接两个大表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个表,table_a 和 table_b,table_a 包含 216646500 行,7155998163 字节;table_b 包含 1462775 行,2096277141 字节

I have two tables, table_a and table_b, table_a contains 216646500 rows, 7155998163 bytes; table_b contains 1462775 rows, 2096277141 bytes

table_a 的 schema 是: c_1, c_2, c_3, c_4 ;table_b 的 schema 是:c_2, c_5, c_6, ...(大约 10 列)

table_a's schema is: c_1, c_2, c_3, c_4 ; table_b's schema is: c_2, c_5, c_6, ... (about 10 columns)

我想做一个 left_outer 连接同一个键 col_2 上的两个表,但是它已经运行了 16 个小时,还没有完成......pyspark代码如下:

I want to do a left_outer join the two tables on the same key col_2, but it has run for 16 hours and hasn't finished yet... The pyspark code is as follow:

combine_table = table_a.join(table_b, table_a.col_2 == table_b.col_2, 'left_outer').collect()

有没有什么有效的方法来连接两个这样的大表?

Is there any effictive way to join two big tables like this?

推荐答案

注意爆炸连接.

使用开放数据集,此查询不会在合理时间内运行:

Working with an open dataset, this query won't run in a reasonable time:

#standardSQL
SELECT COUNT(*)
FROM `fh-bigquery.reddit_posts.2017_06` a
JOIN `fh-bigquery.reddit_comments.2017_06` b
ON a.subreddit=b.subreddit

如果我们去掉每一方的前 100 个连接键会怎样?

What if we get rid of the top 100 joining keys from each side?

#standardSQL
SELECT COUNT(*)
FROM (
  SELECT * FROM `fh-bigquery.reddit_posts.2017_06`
  WHERE subreddit NOT IN (SELECT value FROM UNNEST((
  SELECT APPROX_TOP_COUNT(subreddit, 100) s
  FROM `fh-bigquery.reddit_posts.2017_06`
)))) a
JOIN (
  SELECT * FROM `fh-bigquery.reddit_comments.2017_06` b
  WHERE subreddit NOT IN (SELECT value FROM UNNEST((
  SELECT APPROX_TOP_COUNT(subreddit, 100) s
  FROM `fh-bigquery.reddit_comments.2017_06`
)))) b
ON a.subreddit=b.subreddit

这个修改后的查询运行了 70 秒,结果是:

This modified query ran in 70 seconds, and the result was:

90508538331

900 亿.这是一个爆炸性的连接.我们在一个表中有 900 万行,在第二个表中有 8000 万行,我们的连接产生了 900 亿行 - 即使在消除每一侧的前 100 个键之后也是如此.

90 billion. That's a exploding join. We had 9 million rows in one table, 80 million rows in the second one, and our join produced 90 billion rows - even after eliminating the top 100 keys from each side.

在您的数据中 - 查找任何可能产生过多结果的键,并在产生连接之前将其删除(有时它是默认值,例如 null)

In your data - look for any key that could be producing too many results, and remove it before producing the join (sometimes it's a default value, like null)

这篇关于如何有效地省略连接两个大表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆