加载镶木地板文件并保留相同数量的hdfs分区 [英] load parquet file and keep same number hdfs partitions

查看：107 发布时间：2020/9/4 7:42:43 apache-spark hadoop pyspark apache-spark-sql parquet

本文介绍了加载镶木地板文件并保留相同数量的hdfs分区的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个实木复合地板文件/df保存在具有120个分区的hdfs中. hdfs上每个分区的大小约为43.5 M.

I have a parquet file /df saved in hdfs with 120 partitions. The size of each partition on hdfs is around 43.5 M.

总大小

hdfs dfs -du -s -h /df
5.1 G  15.3 G  /df

hdfs dfs -du -h /df
43.6 M  130.7 M  /df/pid=0
43.5 M  130.5 M  /df/pid=1
...
43.6 M  130.9 M  /df/pid=119

我想将该文件加载到Spark中并保持相同数量的分区. 但是，Spark会自动将文件加载到60个分区中.

I want to load that file into Spark and keep the same number of partitions. However, Spark will automatically load the file into 60 partitions.

df = spark.read.parquet('df')
df.rdd.getNumPartitions()

HDFS设置:

'parquet.block.size'未设置.

sc._jsc.hadoopConfiguration().get('parquet.block.size')

不返回任何内容.

'dfs.blocksize'设置为128.

'dfs.blocksize' is set to 128.

float(sc._jsc.hadoopConfiguration().get("dfs.blocksize"))/2**20

将这些值中的任何一个更改为较低值都不会导致镶木地板文件加载到hdfs中相同数量的分区中.

Changing either of those values to something lower does not result in the parquet file loading into the same number of partitions that are in hdfs.

例如:

sc._jsc.hadoopConfiguration().setInt("parquet.block.size", 64*2**20)
sc._jsc.hadoopConfiguration().setInt("dfs.blocksize", 64*2**20)

我意识到43.5 M远低于128M.但是，对于此应用程序，我将立即完成许多转换，这将导致120个分区中的每个分区都更接近128 M.

I realize 43.5 M is well below 128 M. However, for this application, I am going to immediately complete many transformations that will result in each of the 120 partitions getting much closer to 128 M.

我试图避免自己在加载后不得不重新分区到应用程序中.

I am trying to save myself having to repartition in the application imeadiately after loading.

是否有一种方法可以强制Spark使用与hdfs中存储的分区数量相同的分区来加载镶木地板文件?

Is there a way to force Spark to load the parquet file with the same number of partitions that are stored on the hdfs?

加载镶木地板文件并保留相同数量的hdfs分区 [英] load parquet file and keep same number hdfs partitions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

加载镶木地板文件并保留相同数量的hdfs分区 [英] load parquet file and keep same number hdfs partitions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭