减少任务数 Spark [英] Number reduce tasks Spark

查看:29
本文介绍了减少任务数 Spark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark 计算reduce 任务数的公式是什么?

What is the formula that Spark uses to calculate the number of reduce tasks?

我正在运行几个 spark-sql 查询,reduce 任务的数量始终为 200.这些查询的映射任务数量为 154.我使用的是 Spark 1.4.1.

I am running a couple of spark-sql queries and the number of reduce tasks always is 200. The number of map tasks for these queries is 154. I am on Spark 1.4.1.

这是否与 spark.shuffle.sort.bypassMergeThreshold 相关,默认为 200

Is this related to spark.shuffle.sort.bypassMergeThreshold, which defaults to 200

推荐答案

spark.sql.shuffle.partitions 正是您所追求的.根据 Spark SQL 性能调优指南:

It's spark.sql.shuffle.partitions that you're after. According to the Spark SQL performance tuning guide:

| Property Name                 | Default | Meaning                                        |
+-------------------------------+---------+------------------------------------------------+
| spark.sql.shuffle.partitions  | 200     | Configures the number of partitions to use     |
|                               |         | when shuffling data for joins or aggregations. |

另一个相关的选项是 spark.default.parallelism,它决定了当用户未设置时,由连接、reduceByKey 和并行化等转换返回的 RDD 中的默认分区数",但是这似乎被 Spark SQL 忽略,并且仅在处理普通 RDD 时相关.

Another option that is related is spark.default.parallelism, which determines the 'default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user', however this seems to be ignored by Spark SQL and only relevant when working on plain RDDs.

这篇关于减少任务数 Spark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆