spark.sql.shuffle.partitions和spark.default.parallelism有什么区别? [英] What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?
问题描述
spark.sql.shuffle.partitions
和 spark.default.parallelism
?
我试图在 SparkSQL
中设置它们,但第二阶段的任务数始终为200。 p>
从答案 here , spark.sql.shuffle.partitions
配置混洗连接或聚合数据时使用的分区数。
spark.default.parallelism
是返回的 RDD
s中的默认分区数通过 join
, reduceByKey
和 parallelize
等转换由用户明确设置。请注意, spark.default.parallelism
似乎只适用于原始 RDD
,并在处理数据帧时被忽略。 / p>
如果您正在执行的任务不是连接或聚合,并且您正在处理数据框,那么设置这些将不会产生任何影响。但是,您可以通过调用 df.repartition(numOfPartitions)
来自己设置分区数量(不要忘记将其分配给新的 val
要更改代码中的设置,您可以简单地执行以下操作:
sqlContext.setConf(spark.sql.shuffle.partitions,300)
sqlContext.setConf( spark.default.parallelism,300)
或者,您可以在提交时进行更改作业到 spark-submit
的集群:
./ bin / spark-submit --conf spark.sql.shuffle.partitions = 300 --conf spark.default.parallelism = 300
What's the difference between spark.sql.shuffle.partitions
and spark.default.parallelism
?
I have tried to set both of them in SparkSQL
, but the task number of the second stage is always 200.
From the answer here, spark.sql.shuffle.partitions
configures the number of partitions that are used when shuffling data for joins or aggregations.
spark.default.parallelism
is the default number of partitions in RDD
s returned by transformations like join
, reduceByKey
, and parallelize
when not set explicitly by the user. Note that spark.default.parallelism
seems to only be working for raw RDD
and is ignored when working with dataframes.
If the task you are performing is not a join or aggregation and you are working with dataframes then setting these will not have any effect. You could, however, set the number of partitions yourself by calling df.repartition(numOfPartitions)
(don't forget to assign it to a new val
) in your code.
To change the settings in your code you can simply do:
sqlContext.setConf("spark.sql.shuffle.partitions", "300")
sqlContext.setConf("spark.default.parallelism", "300")
Alternatively, you can make the change when submitting the job to a cluster with spark-submit
:
./bin/spark-submit --conf spark.sql.shuffle.partitions=300 --conf spark.default.parallelism=300
这篇关于spark.sql.shuffle.partitions和spark.default.parallelism有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!