为什么groupBy 200之后的分区数是多少?为什么这200不是其他数字? [英] Why is the number of partitions after groupBy 200? Why is this 200 not some other number?

查看:134
本文介绍了为什么groupBy 200之后的分区数是多少?为什么这200不是其他数字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是Spark 2.2.0-SNAPSHOT.

It's Spark 2.2.0-SNAPSHOT.

在以下示例中,为什么groupBy转换后的分区数为200?

Why is the number of partitions after groupBy transformation 200 in the following example?

scala> spark.range(5).groupByKey(_ % 5).count.rdd.getNumPartitions
res0: Int = 200

200有什么特别之处?为什么不使用其他数字,例如1024?

What's so special about 200? Why not some other number like 1024?

有人告诉我为什么groupByKey操作总是有200个任务?具体问有关groupByKey的问题,但是问题是关于选择200作为默认值的背后的谜团",而不是为什么默认存在200个分区的原因.

I've been told about Why does groupByKey operation have always 200 tasks? that asks specifically about groupByKey, but the question is about the "mystery" behind picking 200 as the default not why there are 200 partitions by default.

推荐答案

这是由spark.sql.shuffle.partitions

This is set by spark.sql.shuffle.partitions

通常,每当您执行Spark sql聚合或将数据进行混洗的联接时,这就是所产生的分区数.

In general whenever you do a spark sql aggregation or join which shuffles data this is the number of resulting partitions.

对于您的整个操作而言,它是恒定的(即,无法一次转换就更改一次,而对于另一转换则不能再次更改).

It is constant for your entire action (i.e. it is not possible to change it for one transformation and then again for another).

请参见 http://spark.apache .org/docs/latest/sql-programming-guide.html#other-configuration-options 了解更多信息

这篇关于为什么groupBy 200之后的分区数是多少?为什么这200不是其他数字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆