如何获取数据集中的分区数? [英] how to get the number of partitions in a dataset?

查看:126
本文介绍了如何获取数据集中的分区数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道有很多相同的问题,但没有一个能真正回答我的问题.

I know there are many questions on the same but none really answers my question.

我有情景数据.

   val data_codes = Seq("con_dist_1","con_dist_2","con_dist_3","con_dist_4","con_dist_5")
    val codes = data_codes.toDF("item_code")
    val partitioned_codes = codes.repartition($"item_code")
    println( "getNumPartitions : " + partitioned_codes.rdd.getNumPartitions);

输出:

getNumPartitions : 200

假设给5分,为什么给200分?我在哪里做错了以及如何解决这个问题?

it suppose to give 5 right why it is giving 200 ? where am i doing wrong and how to fix this?

推荐答案

因为200是spark.sql.shuffle.partitions的标准值,它应用于df.repartition.从文档中:

Because 200 is the standard value of spark.sql.shuffle.partitions which is applied to df.repartition. From the docs :

返回由给定分区分区的新数据集 表达式,使用spark.sql.shuffle.partitions作为数量 分区.结果数据集经过哈希分区.

Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is hash partitioned.

分区数量与数据框中(不同)值的数量不相关.重新分区可确保所有具有相同键的记录都位于同一分区中,而别无其他.因此,在您的情况下,可能是所有记录都在1个分区中,而199个分区为空

The number of partitions is NOT RELATED to the number of (distinct) values in your dataframe. Repartitioning ensures that all records with the same key are in the same partition, nothing else. So in your case it could be that all records are in 1 partition and 199 partitions are empty

即使您执行codes.repartition($"item_code",5),也不能保证您有5个大小相等的分区.抱歉,您无法在数据框API中做到这一点,也许是在具有自定义分区程序的RDD中

Even if you do codes.repartition($"item_code",5), there is no guarantee that you have 5 equally sized partitions. AFAIK you cannot to this in dataframe API, maybe in RDD with custom partitioner

这篇关于如何获取数据集中的分区数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆