星火DataFrames与实木复合地板和分区 [英] Spark DataFrames with Parquet and Partitioning
问题描述
我一直没能找到有关此主题的大量信息,但可以说我们使用了一个数据帧的拼花文件,是10块火花自然会创建10个分区阅读。但是,当数据框的文件来处理它读取,会不会是处理大量数据的分区比例,因为如果它是处理文件uncom pressed块大小将是更大意义上的分区将更大为好。
I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.
因此,让我澄清一下,实木复合地板融为一体pressed(这些数字并不完全准确)。
1GB PAR = 5块= 5个分区可能被DECOM pressed到5GB使其25块/ 25分区。但是,除非你重新分区,你会只用5个分区被卡1GB的par文件时,最佳的将是25分区?或者是我的逻辑是错误的。
So let me clarify, parquet compressed (these numbers are not fully accurate). 1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.
将是有意义的重新分区,以提高速度?还是我在思考这个问题。任何人都可以阐明这一些轻?
Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?
假设:
- 1块= 1分区火花
- 1核心1分区操作
推荐答案
星火数据帧不加载在内存中的文件实木复合地板。它使用的Hadoop / HDFS API的每个操作过程中阅读。因此分区的最佳数量取决于HDFS块大小(不同于镶木块大小!)。
Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).
星火1.5数据帧分区拼花文件如下:
Spark 1.5 DataFrame partitions parquet file as follows:
- 1分区
- 如果HDFS块的大小小于星火拼花块大小配置的分区将多块HDFS如分区的总规模将创建的不低于实木复合地板的块大小
每块HDFS
这篇关于星火DataFrames与实木复合地板和分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!