Spark DataFrames with Parquet and Partitioning [英] Spark DataFrames with Parquet and Partitioning

查看:21
本文介绍了Spark DataFrames with Parquet and Partitioning的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法找到有关此主题的太多信息,但假设我们使用数据帧读取 10 块的镶木地板文件,spark 自然会创建 10 个分区.但是,当数据帧读入文件进行处理时,它不会处理大数据与分区的比率,因为如果它正在处理未压​​缩的文件,则块大小会更大,这意味着分区也会更大.

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.

所以让我澄清一下,实木复合地板压缩(这些数字并不完全准确).1GB Par = 5 Blocks = 5 Partitions 可以解压到 5GB 使其成为 25 个块/25 个分区.但是,除非您重新分区 1GB par 文件,否则您将只能使用 5 个分区,而最佳情况是 25 个分区?或者我的逻辑有问题.

So let me clarify, parquet compressed (these numbers are not fully accurate). 1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.

重新分区以提高速度是否有意义?或者我在想这个错误.任何人都可以对此有所了解吗?

Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?

假设:

  • 1 个块 = 1 个 Spark 分区
  • 1 个内核在 1 个分区上运行

推荐答案

Spark DataFrame 不会在内存中加载 Parquet 文件.它使用 Hadoop/HDFS API 在每个操作期间读取它.因此,最佳分区数取决于 HDFS 块大小(不同于 Parquet 块大小!).

Spark DataFrame doesn't load parquet files in memory. It uses Hadoop/HDFS API to read it during each operation. So the optimal number of partitions depends on HDFS block size (different from a Parquet block size!).

Spark 1.5 DataFrame 分区 Parquet 文件如下:

Spark 1.5 DataFrame partitions parquet file as follows:

  • 每个 HDFS 块 1 个分区
  • 如果 HDFS 块大小小于 Spark parquet 块大小中配置的一个分区,则会为多个 HDFS 块创建一个分区,例如分区的总大小不小于 Parquet 块大小

这篇关于Spark DataFrames with Parquet and Partitioning的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆