Spark DataFrame分区:未保留的分区数 [英] Spark DataFrame repartition : number of partition not preserved

查看:89
本文介绍了Spark DataFrame分区:未保留的分区数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据Spark 1.6.3的文档, repartition(partitionExprs:Column *)应该保留结果数据帧中的分区数量:

According to the docs of Spark 1.6.3, repartition(partitionExprs: Column*) should preserve the number of partitions in the resulting dataframe:

返回由给定分区分区的新DataFrame保留现有分区数量的表达式

Returns a new DataFrame partitioned by the given partitioning expressions preserving the existing number of partitions

(取自但是以下示例似乎显示了其他内容(请注意,在我的情况下,spark-master是 local [4] ):

But the following example seems to show something else (note that spark-master is local[4] in my case):

val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[4]"))
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._

val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x")
myDF.rdd.getNumPartitions // 4 
myDF.repartition($"x").rdd.getNumPartitions //  200 !

如何解释?我将Spark 1.6.3用作独立应用程序(即在IntelliJ IDEA中本地运行)

How can that be explained? I'm using Spark 1.6.3 as a standalone application (i.e. running locally in IntelliJ IDEA)

此问题不能解决从中删除空DataFrame分区的问题Apache Spark (即如何沿列重新分区而不产生空分区),但是为什么文档说的话与我在示例中观察到的不同

This question does not adress the issue from Dropping empty DataFrame partitions in Apache Spark (i.e. how to repartiton along a column without producing empty partitions), but why the docs say something different from what I observe in my example

推荐答案

这与

It is something related to Tungsten project which was enabled in Spark. It uses hardware optimization and calls hash partitioning which triggers shuffle operation. By default spark.sql.shuffle.partitions is set to be 200. You can verify by calling explain on your dataframe before repartitioning and after just calling:

myDF.explain

val repartitionedDF = myDF.repartition($"x")

repartitionedDF.explain

这篇关于Spark DataFrame分区:未保留的分区数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆