Spark会在读取时保持镶木地板分区吗? [英] Does Spark maintain parquet partitioning on read?
问题描述
我很难找到这个问题的答案.假设我将一个数据帧写入镶木地板,然后将repartition
与partitionBy
结合使用,以获得一个分区良好的镶木地板文件.见下文:
I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition
combined with partitionBy
to get a nicely partitioned parquet file. See Below:
df.repartition(col("DATE")).write.partitionBy("DATE").parquet("/path/to/parquet/file")
现在稍后,我想读取镶木地板文件,因此我要执行以下操作:
Now later on I would like to read the parquet file so I do something like this:
val df = spark.read.parquet("/path/to/parquet/file")
数据帧是否被"DATE"
分区了?换句话说,如果对实木复合地板文件进行了分区,当将其读入spark数据帧时,spark会保持该分区.还是随机分区?
Is the dataframe partitioned by "DATE"
? In other words if a parquet file is partitioned does spark maintain that partitioning when reading it into a spark dataframe. Or is it randomly partitioned?
为什么以及为什么不回答这个问题也将有所帮助.
Also the why and why not to this answer would be helpful as well.
推荐答案
读取作为拼花形式存储的数据时获得的分区数遵循与读取分区文本相同的许多规则:
The number of partitions acquired when reading data stored as parquet follows many of the same rules as reading partitioned text:
- 如果SparkContext.minPartitions> =分区计入数据,则将返回SparkContext.minPartitions.
- 如果分区中的数据大于等于SparkContext.parallelism,则将返回SparkContext.parallelism,尽管在某些分区很小的情况下,#3可能是正确的.
- 最后,如果数据中的分区数介于SparkContext.minPartitions和SparkContext.parallelism之间,那么通常您会看到分区反映在数据集分区中.
请注意,分区的实木复合地板文件很少具有分区的完整数据位置,这意味着,即使数据中的分区计数与读取的分区计数匹配,如果您要实现分区数据局部性以提高性能,则应在内存中对数据集重新分区.
Note that it's rare for a partitioned parquet file to have full data locality for a partition, meaning that, even when the partitions count in data matches the read partition count, there is a strong likelihood that the dataset should be repartitioned in memory if you're trying to achieve partition data locality for performance.
鉴于上面的用例,如果您打算在此基础上利用本地分区操作,建议立即在"DATE"列上重新分区.上面有关minPartitions和并行性设置的注意事项也适用于此.
Given your use case above, I'd recommend immediately repartitioning on the "DATE" column if you're planning to leverage partition-local operations on that basis. The above caveats regarding minPartitions and parallelism settings apply here as well.
val df = spark.read.parquet("/path/to/parquet/file")
df.repartition(col("DATE"))
这篇关于Spark会在读取时保持镶木地板分区吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!