数量减少任务Spark [英] Number reduce tasks Spark

查看:91
本文介绍了数量减少任务Spark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark用于计算化简任务数量的公式是什么?

What is the formula that Spark uses to calculate the number of reduce tasks?

我正在运行几个spark-sql查询,reduce任务的数量始终为200.这些查询的map任务的数量为154.我使用的是Spark 1.4.1.

I am running a couple of spark-sql queries and the number of reduce tasks always is 200. The number of map tasks for these queries is 154. I am on Spark 1.4.1.

这与spark.shuffle.sort.bypassMergeThreshold有关,默认值为200

Is this related to spark.shuffle.sort.bypassMergeThreshold, which defaults to 200

推荐答案

您要使用的是spark.sql.shuffle.partitions.根据 Spark SQL编程指南:

spark.sql.shuffle.partitions    200     Configures the number of partitions to use when shuffling data for joins or aggregations.

另一个相关的选项是spark.default.parallelism,它确定"RDD中由join,reduceByKey和未由用户设置的并行化转换返回的分区的默认数目",但是,Spark SQL和仅在处理普通RDD时相关.

Another option that is related is spark.default.parallelism, which determines the 'default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user', however this seems to be ignored by Spark SQL and only relevant when working on plain RDDs.

这篇关于数量减少任务Spark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆