在Spark SQL中联接大表的优化方法是什么 [英] What is an optimized way of joining large tables in Spark SQL
问题描述
我需要使用Spark SQL或Dataframe API联接表.需要知道什么样的优化方法才能实现它.
I have a need of joining tables using Spark SQL or Dataframe API. Need to know what would be optimized way of achieving it.
场景是:
- 所有数据均以Hive形式以ORC格式存在(基本数据帧和参考文件).
- 我需要将一个从Hive读取的基本文件(数据帧)与其他11-13个参考文件结合起来,以创建一个大型的内存结构(400列)(大小约为1 TB)
什么是实现这一目标的最佳方法?如果有人遇到类似问题,请分享您的经验.
What can be best approach to achieve this? Please share your experience if some one has encounter similar problem.
推荐答案
关于如何优化联接的默认建议是:
My default advice on how to optimize joins is:
Use a broadcast join if you can (see this notebook). From your question it seems your tables are large and a broadcast join is not an option.
考虑使用非常大的群集(您可能会觉得更便宜).目前(250美元/250美元)(6/2016)在EC2现货实例市场上购买24小时的800核,6Tb RAM和许多SSD.在考虑大数据解决方案的总成本时,我发现人类往往会严重低估他们的时间.
Consider using a very large cluster (it's cheaper that you may think). $250 right now (6/2016) buys about 24 hours of 800 cores with 6Tb RAM and many SSDs on the EC2 spot instance market. When thinking about total cost of a big data solution, I find that humans tend to substantially undervalue their time.
使用相同的分区程序.请参阅此问题以获取有关联合分组联接的信息.
Use the same partitioner. See this question for information on co-grouped joins.
如果数据量巨大和/或您的群集无法增长,甚至上述(3)都导致OOM,请使用两次通过方法.首先,将数据重新分区并使用分区表(dataframe.write.partitionBy()
)保留.然后,将子分区依次循环连接,追加"到相同的最终结果表中.
If the data is huge and/or your clusters cannot grow such that even (3) above leads to OOM, use a two-pass approach. First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()
). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
旁注:我在上面说追加"是因为在生产中我从未使用过SaveMode.Append
.它不是幂等的,这是危险的事情.我在分区表树结构的子树中使用SaveMode.Overwrite
.在2.0.0和1.6.2之前,您必须删除_SUCCESS
或元数据文件,否则动态分区发现会很困难.
Side note: I say "appending" above because in production I never use SaveMode.Append
. It is not idempotent and that's a dangerous thing. I use SaveMode.Overwrite
deep into the subtree of a partitioned table tree structure. Prior to 2.0.0 and 1.6.2 you'll have to delete _SUCCESS
or metadata files or dynamic partition discovery will choke.
希望这会有所帮助.
这篇关于在Spark SQL中联接大表的优化方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!