在Spark SQL中联接大表的优化方法是什么 [英] What is an optimized way of joining large tables in Spark SQL

查看:622
本文介绍了在Spark SQL中联接大表的优化方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用Spark SQL或Dataframe API联接表.需要知道什么样的优化方法才能实现它.

I have a need of joining tables using Spark SQL or Dataframe API. Need to know what would be optimized way of achieving it.

场景是:

  1. 所有数据均以Hive形式以ORC格式存在(基本数据帧和参考文件).
  2. 我需要将一个从Hive读取的基本文件(数据帧)与其他11-13个参考文件结合起来,以创建一个大型的内存结构(400列)(大小约为1 TB)

什么是实现这一目标的最佳方法?如果有人遇到类似问题,请分享您的经验.

What can be best approach to achieve this? Please share your experience if some one has encounter similar problem.

推荐答案

关于如何优化联接的默认建议是:

My default advice on how to optimize joins is:

  1. 如果可以,请使用广播联接(请参阅

  1. Use a broadcast join if you can (see this notebook). From your question it seems your tables are large and a broadcast join is not an option.

考虑使用非常大的群集(您可能会觉得更便宜).目前(250美元/250美元)(6/2016)在EC2现货实例市场上购买24小时的800核,6Tb RAM和许多SSD.在考虑大数据解决方案的总成本时,我发现人类往往会严重低估他们的时间.

Consider using a very large cluster (it's cheaper that you may think). $250 right now (6/2016) buys about 24 hours of 800 cores with 6Tb RAM and many SSDs on the EC2 spot instance market. When thinking about total cost of a big data solution, I find that humans tend to substantially undervalue their time.

使用相同的分区程序.请参阅此问题以获取有关联合分组联接的信息.

Use the same partitioner. See this question for information on co-grouped joins.

如果数据量巨大和/或您的群集无法增长,甚至上述(3)都导致OOM,请使用两次通过方法.首先,将数据重新分区并使用分区表(dataframe.write.partitionBy())保留.然后,将子分区依次循环连接,追加"到相同的最终结果表中.

If the data is huge and/or your clusters cannot grow such that even (3) above leads to OOM, use a two-pass approach. First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.

旁注:我在上面说追加"是因为在生产中我从未使用过SaveMode.Append.它不是幂等的,这是危险的事情.我在分区表树结构的子树中使用SaveMode.Overwrite.在2.0.0和1.6.2之前,您必须删除_SUCCESS或元数据文件,否则动态分区发现会很困难.

Side note: I say "appending" above because in production I never use SaveMode.Append. It is not idempotent and that's a dangerous thing. I use SaveMode.Overwrite deep into the subtree of a partitioned table tree structure. Prior to 2.0.0 and 1.6.2 you'll have to delete _SUCCESS or metadata files or dynamic partition discovery will choke.

希望这会有所帮助.

这篇关于在Spark SQL中联接大表的优化方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆