spark.sql.shuffle.partitions和spark.default.parallelism有什么区别? [英] What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

查看:3743
本文介绍了spark.sql.shuffle.partitions和spark.default.parallelism有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

spark.sql.shuffle.partitions spark.default.parallelism

我试图在 SparkSQL 中设置它们,但第二阶段的任务数始终为200。 p>

解决方案

从答案 here spark.sql.shuffle.partitions 配置混洗连接或聚合数据时使用的分区数。



spark.default.parallelism 是返回的 RDD s中的默认分区数通过 join reduceByKey parallelize 等转换由用户明确设置。请注意, spark.default.parallelism 似乎只适用于原始 RDD ,并在处理数据帧时被忽略。 / p>

如果您正在执行的任务不是连接或聚合,并且您正在处理数据框,那么设置这些将不会产生任何影响。但是,您可以通过调用 df.repartition(numOfPartitions)来自己设置分区数量(不要忘记将其分配给新的 val






要更改代码中的设置,您可以简单地执行以下操作:

  sqlContext.setConf(spark.sql.shuffle.partitions,300)
sqlContext.setConf( spark.default.parallelism,300)

或者,您可以在提交时进行更改作业到 spark-submit 的集群:

  ./ bin / spark-submit --conf spark.sql.shuffle.partitions = 300 --conf spark.default.parallelism = 300 


What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

I have tried to set both of them in SparkSQL, but the task number of the second stage is always 200.

解决方案

From the answer here, spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations.

spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set explicitly by the user. Note that spark.default.parallelism seems to only be working for raw RDD and is ignored when working with dataframes.

If the task you are performing is not a join or aggregation and you are working with dataframes then setting these will not have any effect. You could, however, set the number of partitions yourself by calling df.repartition(numOfPartitions) (don't forget to assign it to a new val) in your code.


To change the settings in your code you can simply do:

sqlContext.setConf("spark.sql.shuffle.partitions", "300")
sqlContext.setConf("spark.default.parallelism", "300")

Alternatively, you can make the change when submitting the job to a cluster with spark-submit:

./bin/spark-submit --conf spark.sql.shuffle.partitions=300 --conf spark.default.parallelism=300

这篇关于spark.sql.shuffle.partitions和spark.default.parallelism有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆