Spark DataFrame分区和镶木地板分区 [英] Spark DataFrame Repartition and Parquet Partition

查看:123
本文介绍了Spark DataFrame分区和镶木地板分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  1. 我正在对列使用分区,以将数据存储在镶木地板中.但 我看到没有.实木复合地板分区文件的数量与 不. Rdd分区. rdd分区之间没有相关性 和实木复合地板隔板?

  1. I am using repartition on columns to store the data in parquet. But I see that the no. of parquet partitioned files are not same with the no. of Rdd partitions. Is there no correlation between rdd partitions and parquet partitions?

当我将数据写入镶木地板分区并使用Rdd时 重新分区,然后我从实木复合地板分区读取数据,是 rdd分区号相同时有任何条件 在读/写期间?

When I write the data to parquet partition and I use Rdd repartition and then I read the data from parquet partition , is there any condition when the rdd partition numbers will be same during read / write?

如何使用列ID对数据框进行存储并重新分区 数据框通过相同的列ID是否不同?

How is bucketing a dataframe using a column id and repartitioning a dataframe via the same column id different?

在考虑Spark中联接的性能时,我们应该 在查看存储分区或重新分区(或可能同时分区)

While considering the performance of joins in Spark should we be looking at bucketing or repartitioning (or maybe both)

推荐答案

您正在询问的事物-数据的分区,存储和平衡

Couple of things here that you;re asking - Partitioning, Bucketing and Balancing of data,

分区:

  1. 分区数据通常用于水平分布负载,这具有性能优势,并有助于以逻辑方式组织数据.
  2. 分区表更改了持久数据的结构,现在将创建反映此分区结构的子目录.
  3. 这可以大大提高查询性能,但前提是分区方案能够反映出常见的过滤条件.

在Spark中,这是由df.write.partitionedBy(column*)完成的,并且通过将columns分区到同一子目录中来对数据进行分组.

In Spark, this is done by df.write.partitionedBy(column*) and groups data by partitioning columns into same sub directory.

装箱:

  1. 装桶是另一种将数据集分解为更易于管理的部分的技术.根据提供的列,将整个数据散列到用户定义数量的存储桶(文件)中.
  2. 与Hive的Distribute By
  3. 同义
  1. Bucketing is another technique for decomposing data sets into more manageable parts. Based on columns provided, the entire data is hashed into a user-defined number of buckets (files).
  2. Synonymous to Hive's Distribute By

在Spark中,此操作由df.write.bucketBy(n, column*)完成,并通过将columns分区到同一文件中来对数据进行分组.生成的文件数由n

In Spark, this is done by df.write.bucketBy(n, column*) and groups data by partitioning columns into same file. number of files generated is controlled by n

分区:

  1. 它根据给定的分区表达式将给定数量的内部文件平均返回一个新的DataFrame平衡.所得的DataFrame被哈希分区.
  2. Spark可管理这些分区上的数据,从而有助于以最小的网络流量并行化分布式数据处理,以便在执行程序之间发送数据.
  1. It returns a new DataFrame balanced evenly based on given partitioning expressions into given number of internal files. The resulting DataFrame is hash partitioned.
  2. Spark manages data on these partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.

在Spark中,这是由df.repartition(n, column*)完成的,并且通过将columns分区到相同的内部分区文件中来对数据进行分组.请注意,没有数据可持久存储,这只是基于类似于bucketBy

In Spark, this is done by df.repartition(n, column*) and groups data by partitioning columns into same internal partition file. Note that no data is persisted to storage, this is just internal balancing of data based on constraints similar to bucketBy

Tl; dr

1)我正在对列使用分区,以将数据存储在拼花中.但我看到没有.实木复合地板分区文件的编号与否不同. Rdd分区. rdd分区和实木复合地板分区之间没有相关性吗?

  • 分区与bucketBy相关,而不与partitionedBy相关.分区文件受其他配置(例如spark.sql.shuffle.partitionsspark.default.parallelism
  • )控制
  • repartition has correlation to bucketBy not partitionedBy. partitioned files is governed by other configs like spark.sql.shuffle.partitions and spark.default.parallelism

2)当我将数据写入镶木地板分区并使用Rdd重新分区,然后从镶木地板分区读取数据时,在读/写期间rdd分区号相同时是否存在任何条件吗?

2) When I write the data to parquet partition and I use Rdd repartition and then I read the data from parquet partition , is there any condition when the rdd partition numbers will be same during read / write?

  • 在读取期间,分区数将等于spark.default.parallelism

3)使用列ID存储数据帧和通过相同的列ID重新划分数据帧有何不同?

  • 工作方式类似,不同之处在于,存储区是一种写操作,用于持久化.

4)在考虑Spark中联接的性能时,我们应该着眼于存储分区或重新分区(或者可能是两者)

    两个数据集的
  • repartition都在内存中,如果其中一个或两个数据集都保留了,则还要查看bucketBy.
  • repartition of both datasets are in memory, if one or both the datasets are persisted, then look into bucketBy also.

这篇关于Spark DataFrame分区和镶木地板分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆