spark.sql.shuffle.partitions和spark.default.parallelism有什么区别？ [英] What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

查看：3743 发布时间：2018/5/31 18:34:57 hadoop apache-spark apache-spark-sql

本文介绍了spark.sql.shuffle.partitions和spark.default.parallelism有什么区别？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

spark.sql.shuffle.partitions 和 spark.default.parallelism ？

我试图在 SparkSQL 中设置它们，但第二阶段的任务数始终为200。 p>

解决方案

从答案 here ， spark.sql.shuffle.partitions 配置混洗连接或聚合数据时使用的分区数。

spark.default.parallelism 是返回的 RDD s中的默认分区数通过 join ， reduceByKey 和 parallelize 等转换由用户明确设置。请注意， spark.default.parallelism 似乎只适用于原始 RDD ，并在处理数据帧时被忽略。 / p>

如果您正在执行的任务不是连接或聚合，并且您正在处理数据框，那么设置这些将不会产生任何影响。但是，您可以通过调用 df.repartition（numOfPartitions）来自己设置分区数量（不要忘记将其分配给新的 val

 
 
 
 
 
 要更改代码中的设置，您可以简单地执行以下操作：  
 
 
  sqlContext.setConf（spark.sql.shuffle.partitions，300）
 sqlContext.setConf（ spark.default.parallelism，300）
  
或者，您可以在提交时进行更改作业到 spark-submit 的集群：
 
 
  ./ bin / spark-submit --conf spark.sql.shuffle.partitions = 300 --conf spark.default.parallelism = 300 
  
 
What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism?


I have tried to set both of them in SparkSQL, but the task number of the second stage is always 200.
 解决方案 
From the answer here, spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations.

spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set explicitly by the user. Note that spark.default.parallelism seems to only be working for raw RDD and is ignored when working with dataframes.

If the task you are performing is not a join or aggregation and you are working with dataframes then setting these will not have any effect. You could, however, set the number of partitions yourself by calling df.repartition(numOfPartitions) (don't forget to assign it to a new val) in your code.



To change the settings in your code you can simply do:
sqlContext.setConf("spark.sql.shuffle.partitions", "300")
sqlContext.setConf("spark.default.parallelism", "300")
Alternatively, you can make the change when submitting the job to a cluster with spark-submit:
./bin/spark-submit --conf spark.sql.shuffle.partitions=300 --conf spark.default.parallelism=300


                        
这篇关于spark.sql.shuffle.partitions和spark.default.parallelism有什么区别？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

spark.sql.shuffle.partitions和spark.default.parallelism有什么区别？ [英] What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

spark.sql.shuffle.partitions和spark.default.parallelism有什么区别？ [英] What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭