星火DataFrames与实木复合地板和分区 [英] Spark DataFrames with Parquet and Partitioning

查看:170
本文介绍了星火DataFrames与实木复合地板和分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直没能找到有关此主题的大量信息,但可以说我们使用了一个数据帧的拼花文件,是10块火花自然会创建10个分区阅读。但是,当数据框的文件来处理它读取,会不会是处理大量数据的分区比例,因为如果它是处理文件uncom pressed块大小将是更大意义上的分区将更大为好。

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.

因此​​,让我澄清一下,实木复合地板融为一体pressed(这些数字并不完全准确)。
1GB PAR = 5块= 5个分区可能被DECOM pressed到5GB使其25块/ 25分区。但是,除非你重新分区,你会只用5个分区被卡1GB的par文件时,最佳的将是25分区?或者是我的逻辑是错误的。

So let me clarify, parquet compressed (these numbers are not fully accurate). 1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.

将是有意义的重新分区,以提高速度?还是我在思考这个问题。任何人都可以阐明这一些轻?

Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?

假设:


  • 1块= 1分区火花

  • 1核心1分区操作

推荐答案

星火数据帧不加载在内存中的文件实木复合地板。它使用的Hadoop / HDFS API的每个操作过程中阅读。因此分区的最佳数量取决于HDFS块大小(不同于镶木块大小!)。

Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).

星火1.5数据帧分区拼花文件如下:

Spark 1.5 DataFrame partitions parquet file as follows:


    每块HDFS
  • 1分区

  • 如果HDFS块的大小小于星火拼花块大小配置的分区将多块HDFS如分区的总规模将创建的不低于实木复合地板的块大小

这篇关于星火DataFrames与实木复合地板和分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆