Spark在WholeTextFiles上创建的分区少于minPartitions [英] Spark Creates Less Partitions Then minPartitions Argument on WholeTextFiles

查看：181 发布时间：2020/10/22 19:01:00 apache-spark hdfs yarn distributed-computing partitioning

本文介绍了Spark在WholeTextFiles上创建的分区少于minPartitions的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含14个文件的文件夹。我在具有10个执行程序的群集上运行spark-submit，该群集的资源管理器为yarn。

I have a folder which has 14 files in it. I run the spark-submit with 10 executors on a cluster, which has resource manager as yarn.

我按以下方式创建第一个RDD：

I create my first RDD as this:

JavaPairRDD<String,String> files = sc.wholeTextFiles(folderPath.toString(), 10);

但是， files.getNumPartitions（）提供我是7岁还是8岁，随机出现。然后，我不在任何地方使用合并/分区，而是使用7-8个分区来完成DAG。

However, files.getNumPartitions()gives me 7 or 8, randomly. Then I do not use coalesce/repartition anywhere and I finish my DAG with 7-8 partitions.

我知道，我们给出的参数是分区的最小数量，那么为什么Spark会将我的RDD划分为7-8个分区？

As I know, we gave argument as the "minimum" number of partitions, so that why Spark divide my RDD to 7-8 partitions?

我还运行了具有20个分区的同一程序，所以给了我11个分区。

I also run the same program with 20 partitions and it gave me 11 partitions.

我在这里看到了一个话题，但这是关于更多分区的，这根本对我没有帮助。

I have seen a topic here, but it was about "more" partitions, which did not help me at all.

注意：在程序中，我读取了另一个包含10个文件的文件夹，Spark成功创建了10个分区。在成功完成这项工作后，我进行了上述有问题的转换。

Note: In the program, I read another folder which has 10 files, and Spark creates 10 partitions successfully. I run the above problematic transformation after this successful job is finished.

文件大小：
1）25.07 KB
2）46.61 KB
3）126.34 KB
4）158.15 KB
5）169.21 KB
6）16.03 KB
7）67.41 KB
8）60.84 KB
9 ）70.83 KB
10）87.94 KB
11）99.29 KB
12）120.58 KB
13）170.43 KB
14）183.87 KB

File sizes: 1)25.07 KB 2)46.61 KB 3)126.34 KB 4)158.15 KB 5)169.21 KB 6)16.03 KB 7)67.41 KB 8)60.84 KB 9)70.83 KB 10)87.94 KB 11)99.29 KB 12)120.58 KB 13)170.43 KB 14)183.87 KB

文件位于HDFS上，块大小为128MB，复制因子3。

Files are on the HDFS, block sizes are 128MB, replication factor 3.

Spark在WholeTextFiles上创建的分区少于minPartitions [英] Spark Creates Less Partitions Then minPartitions Argument on WholeTextFiles

问题描述

推荐答案

注意：

Note:

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark在WholeTextFiles上创建的分区少于minPartitions [英] Spark Creates Less Partitions Then minPartitions Argument on WholeTextFiles

问题描述

推荐答案

注意：

Note:

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭