加载镶木地板文件并保留相同数量的hdfs分区 [英] load parquet file and keep same number hdfs partitions
问题描述
我有一个实木复合地板文件/df
保存在具有120个分区的hdfs中. hdfs上每个分区的大小约为43.5 M.
I have a parquet file /df
saved in hdfs with 120 partitions. The size of each partition on hdfs is around 43.5 M.
总大小
hdfs dfs -du -s -h /df
5.1 G 15.3 G /df
hdfs dfs -du -h /df
43.6 M 130.7 M /df/pid=0
43.5 M 130.5 M /df/pid=1
...
43.6 M 130.9 M /df/pid=119
我想将该文件加载到Spark中并保持相同数量的分区. 但是,Spark会自动将文件加载到60个分区中.
I want to load that file into Spark and keep the same number of partitions. However, Spark will automatically load the file into 60 partitions.
df = spark.read.parquet('df')
df.rdd.getNumPartitions()
60
HDFS设置:
'parquet.block.size'
未设置.
sc._jsc.hadoopConfiguration().get('parquet.block.size')
不返回任何内容.
'dfs.blocksize'设置为128.
'dfs.blocksize' is set to 128.
float(sc._jsc.hadoopConfiguration().get("dfs.blocksize"))/2**20
返回
128
将这些值中的任何一个更改为较低值都不会导致镶木地板文件加载到hdfs中相同数量的分区中.
Changing either of those values to something lower does not result in the parquet file loading into the same number of partitions that are in hdfs.
例如:
sc._jsc.hadoopConfiguration().setInt("parquet.block.size", 64*2**20)
sc._jsc.hadoopConfiguration().setInt("dfs.blocksize", 64*2**20)
我意识到43.5 M远低于128M.但是,对于此应用程序,我将立即完成许多转换,这将导致120个分区中的每个分区都更接近128 M.
I realize 43.5 M is well below 128 M. However, for this application, I am going to immediately complete many transformations that will result in each of the 120 partitions getting much closer to 128 M.
我试图避免自己在加载后不得不重新分区到应用程序中.
I am trying to save myself having to repartition in the application imeadiately after loading.
是否有一种方法可以强制Spark使用与hdfs中存储的分区数量相同的分区来加载镶木地板文件?
Is there a way to force Spark to load the parquet file with the same number of partitions that are stored on the hdfs?
推荐答案
首先,我将从检查Spark如何将数据拆分为分区开始. 默认情况下,它取决于您数据的性质和大小以及簇. 本文应该为您提供答案,为什么您的数据帧会被加载到60个分区中:
First, I'd start from checking on how Spark splits the data into partitions. By default it depends on the nature and size of your data & cluster. This article should provide you with the answer why your data frame was loaded to 60 partitions:
通常-它的 Catalyst 负责所有优化(包括分区数),因此,除非确实有充分的理由进行自定义设置,否则我将让它完成工作.如果您使用的任何转换都很宽泛,Spark都会以随机方式洗改数据.
In general - its Catalyst who takes care of all the optimization (including number of partitions), so unless there is really a good reason for custom settings, I'd let it do its job. If any of the transformations you use are wide, Spark will shuffle the data anyway.
这篇关于加载镶木地板文件并保留相同数量的hdfs分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!