为什么repartition()方法会增加磁盘上的文件大小? [英] Why does the repartition() method increase file size on disk?

查看:272
本文介绍了为什么repartition()方法会增加磁盘上的文件大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用的数据湖(df)具有2 TB的数据和20,000个文件.我想将数据集压缩为2,000个1 GB文件.

A data lake I am working with (df) has 2 TB of data and 20,000 files. I'd like to compact the data set into 2,000 1 GB files.

如果运行df.coalesce(2000)并写出到磁盘,则数据湖包含1.9 TB的数据.

If you run df.coalesce(2000) and write out to disk, the data lake contains 1.9 TB of data.

如果运行df.repartition(2000)并将其写出到磁盘,则数据湖包含2.6 TB的数据.

If you run df.repartition(2000) and write out to disk, the data lake contains 2.6 TB of data.

repartition()数据湖中的每个文件正好比预期的大小大0.3 GB(它们都是1.3 GB的文件,而不是1 GB的文件).

Each file in the repartition() data lake is exactly 0.3 GB larger than expected (they’re all 1.3 GB files instead of 1 GB files).

为什么repartition()方法会增加整个数据湖的大小?

Why does the repartition() method increase the size of the overall data lake?

相关问题讨论为什么运行聚合后数据湖的大小会增加.答案是:

There is a related question that discusses why the size of a data lake increases after aggregations are run. The answer says:

通常,像Parquet这样的列式存储格式在数据分发(数据组织)和单个列的基数方面非常敏感.数据越有条理,基数越低,存储效率就越高.

In general columnar storage formats like Parquet are highly sensitive when it comes to data distribution (data organization) and cardinality of individual columns. The more organized is data and the lower is cardinality the more efficient is the storage.

coalesce()算法是否提供了更有条理的数据...我认为不是...

Is the coalesce() algorithm providing data that's more organized... I don't think so...

我不认为其他问题能回答我的问题.

I don't think the other question answers my question.

推荐答案

免责声明:

此答案主要包含推测.要对此现象进行详细说明,可能需要对输入和输出(或至少它们各自的元数据)进行深入分析.

This answer contains primarily speculations. A detailed explanation of this phenomena might require in-depth analysis of the input and output (or at least their respective metadata).

观察:

  1. 熵有效地限制了可能的最强无损压缩的性能-持久列格式游程长度编码词典编码)以减少存储数据的内存占用量
  1. Entropy effectively bounds the performance of the strongest lossless compression possible - Wikipedia - Entropy (information theory).
  2. Both persistent columnar formats as well as the internal Spark SQL representation transparently apply different compression techniques (like Run-length encoding or dictionary encoding) to reduce the memory footprint of the stored data.

另外,可以使用通用压缩算法对磁盘格式(包括纯文本数据)进行显式压缩-目前尚不清楚这种情况.

Additionally on disk formats (including plain text data) can be explicitly compressed using general purpose compression algorithms - it is not clear if this is the case here.

压缩(显式或透明)应用于数据块(通常是分区,但可以使用较小的单位).

Compression (explicit or transparent) are applied to blocks of data (typically partitions, but smaller units can be used).

基于1),2)和3),我们可以假设平均压缩率将取决于群集中数据的分布.我们还应注意,如果上游谱系包含广泛的转化,则最终结果可能是不确定的.

Based on 1), 2) and 3) we can assume that the average compression rate will depend on the distribution of the data in the cluster. We should also note that the final result can be non-deterministic, if the upstream lineage contains wide transformations.

coalescerepartition 的可能影响:

Possible impact of coalesce vs. repartition:

通常 coalesce 可以采用两条路径:

In general coalesce can take two paths:

  • 通过管道升级到源头-最常见的情况.
  • 传播到最近的随机播放.

在第一种情况下,我们可以预期压缩率将与输入的压缩率相当.但是,在某些情况下可以实现更小的最终输出.让我们想象一个退化的数据集:

In the first case we can expect that the compression rate will be comparable to the compression rate of the input. However there are some cases where can achieve much smaller final output. Let's imagine a degenerate dataset:

val df = sc.parallelize(
  Seq("foo", "foo", "foo", "bar", "bar", "bar"),
  6 
).toDF

如果将这样的数据集写入磁盘,则没有压缩的可能-每个值都必须原样写入:

If dataset like this was written to disk there would be no potential for compression - each value has to be written as-is:

df.withColumn("pid", spark_partition_id).show

+-----+---+
|value|pid|
+-----+---+
|  foo|  0|
|  foo|  1|
|  foo|  2|
|  bar|  3|
|  bar|  4|
|  bar|  5|
+-----+---+

换句话说,我们大约需要6 * 3个字节,总共需要18个字节.

In other words we need roughly 6 * 3 bytes giving 18 bytes in total.

但是如果我们合并

df.coalesce(2).withColumn("pid", spark_partition_id).show

+-----+---+
|value|pid|
+-----+---+
|  foo|  0|
|  foo|  0|
|  foo|  0|
|  bar|  1|
|  bar|  1|
|  bar|  1|
+-----+---+

例如,我们可以将int较小的RLE应用于计数,并将每个分区存储3 + 1字节,总共8个字节.

we can for example apply RLE with small int as the count, and store each partition 3 + 1 bytes giving 8 bytes in total.

这当然是一个极大的简化,但是显示了保持低熵输入结构以及合并块如何可以减少内存占用.

This is of course a huge oversimplification, but shows how preserving low entropy input structure, and merging blocks can result in a lower memory footprint.

第二个coalesce场景不太明显,但是在某些场景中,可以通过上游过程来减少熵(例如,有关窗口函数的信息),并且保留这​​种结构将是有益的.

The second coalesce scenario is less obvious, but there are scenarios where entropy can be reduced by the upstream process (think for example about window functions) and preserving such structure will be beneficial.

repartition 怎么样?

What about repartition?

无分区表达式repartition适用RoundRobinPartitioning(根据分区ID基于伪随机密钥实现为HashPartitioning).只要哈希函数具有合理的行为,这种重新分配就应该使数据的熵最大化,从而降低可能的压缩率.

Without partitioning expression repartition applies RoundRobinPartitioning (implemented as HashPartitioning with a pseudo-random key based on the partition id). As long as the hash function behaves sensibly, such redistribution should maximize the entropy of the data and as a result decrease possible compression rate.

结论:

coalesce不应仅提供任何特定的好处,而可以保留数据分发的现有属性-该属性在某些情况下可能是有利的.

coalesce shouldn't provide any specific benefits alone, but can preserve existing properties of data distribution - this property can be advantageous in certain cases.

repartition由于其性质,平均会使情况变得更糟,除非数据的熵已经最大化(这种情况可能会有所改善,但在非平凡的数据集上极不可能).

repartition, due to its nature, will on average makes things worse, unless entropy of the data is already maximized (a scenario where things improve is possible, but highly unlikely on non-trivial dataset).

最后带有分区表达的repartitionrepartitionByRange应该减少熵并提高压缩率.

Finally repartition with partitioning expression or repartitionByRange should decrease entropy, and improve compression rates.

注意:

我们还应该记住,列格式通常基于运行时统计信息来决定特定的压缩/编码方法(或缺乏压缩方法).因此,即使特定块中的行集是固定的,但是行的顺序发生了变化,我们也可以观察到不同的结果.

We should also keep in mind that columnar formats usually decide on a specific compression / encoding method (or lack of it) based on the runtime statistics. So even if the set of rows in a particular block is fixed, but the order of rows changes, we can observe different outcomes.

这篇关于为什么repartition()方法会增加磁盘上的文件大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆