计算Spark数据帧的大小-SizeEstimator提供意外结果 [英] Compute size of Spark dataframe - SizeEstimator gives unexpected results

查看:573
本文介绍了计算Spark数据帧的大小-SizeEstimator提供意外结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到一种可靠的方式来以编程方式计算Spark数据帧的大小(以字节为单位).

I am trying to find a reliable way to compute the size (in bytes) of a Spark dataframe programmatically.

原因是我想拥有一种方法来计算最佳"数量的分区(最佳"可能在这里表示不同的含义:可能表示在写入镶木地板表时会产生最佳文件大小-但可以将两者假定为线性函数数据帧大小的大小).换句话说,我想在数据帧上调用coalesce(n)repartition(n),其中n不是固定数字,而是数据帧大小的函数.

The reason is that I would like to have a method to compute an "optimal" number of partitions ("optimal" could mean different things here: it could mean having an optimal partition size, or resulting in an optimal file size when writing to Parquet tables - but both can be assumed to be some linear function of the dataframe size). In other words, I would like to call coalesce(n) or repartition(n) on the dataframe, where n is not a fixed number but rather a function of the dataframe size.

关于SO的其他主题建议使用org.apache.spark.util中的SizeEstimator.estimate来获取数据帧的大小(以字节为单位),但是我得到的结果不一致.

Other topics on SO suggest using SizeEstimator.estimate from org.apache.spark.util to get the size in bytes of the dataframe, but the results I'm getting are inconsistent.

首先,我将数据帧保存到内存中

First of all, I'm persisting my dataframe to memory:

df.cache().count 

Spark UI在存储"选项卡中显示大小为4.8GB.然后,我运行以下命令从SizeEstimator获取大小:

The Spark UI shows a size of 4.8GB in the Storage tab. Then, I run the following command to get the size from SizeEstimator:

import org.apache.spark.util.SizeEstimator
SizeEstimator.estimate(df)

结果为115'715'808字节=〜116MB.但是,将SizeEstimator应用于不同的对象会导致非常不同的结果.例如,我尝试分别计算数据帧中每一行的大小并将其求和:

This gives a result of 115'715'808 bytes =~ 116MB. However, applying SizeEstimator to different objects leads to very different results. For instance, I try computing the size separately for each row in the dataframe and sum them:

df.map(row => SizeEstimator.estimate(row.asInstanceOf[ AnyRef ])).reduce(_+_)

这将导致大小为12'084'698'256字节=〜12GB.或者,我可以尝试将SizeEstimator应用于每个分区:

This results in a size of 12'084'698'256 bytes =~ 12GB. Or, I can try to apply SizeEstimator to every partition:

df.mapPartitions(
    iterator => Seq(SizeEstimator.estimate(
        iterator.toList.map(row => row.asInstanceOf[ AnyRef ]))).toIterator
).reduce(_+_)

这又导致10'792'965'376字节=〜10.8GB的不同大小.

which results again in a different size of 10'792'965'376 bytes =~ 10.8GB.

我了解其中涉及内存优化/内存开销,但是执行这些测试后,我看不到如何使用SizeEstimator来获得足够好的数据帧大小(以及分区大小,或生成的Parquet文件大小).

I understand there are memory optimizations / memory overhead involved, but after performing these tests I don't see how SizeEstimator can be used to get a sufficiently good estimate of the dataframe size (and consequently of the partition size, or resulting Parquet file sizes).

为了获得对数据帧大小或其分区的良好估计,应用SizeEstimator的适当方法(如果有)是什么?如果没有,这里建议的方法是什么?

What is the appropriate way (if any) to apply SizeEstimator in order to get a good estimate of a dataframe size or of its partitions? If there isn't any, what is the suggested approach here?

推荐答案

不幸的是,我无法从SizeEstimator获得可靠的估计,但是我可以找到另一种策略-如果数据帧已缓存,我们可以提取其大小从queryExecution如下:

Unfortunately, I was not able to get reliable estimates from SizeEstimator, but I could find another strategy - if the dataframe is cached, we can extract its size from queryExecution as follows:

df.cache.foreach(_=>_)
val catalyst_plan = df.queryExecution.logical
val df_size_in_bytes = spark.sessionState.executePlan(
    catalyst_plan).optimizedPlan.stats.sizeInBytes

对于示例数据帧,这恰好提供了4.8GB(这也对应于写入未压缩的Parquet表时的文件大小).

For the example dataframe, this gives exactly 4.8GB (which also corresponds to the file size when writing to an uncompressed Parquet table).

这具有需要缓存数据帧的缺点,但对我而言这不是问题.

This has the disadvantage that the dataframe needs to be cached, but it is not a problem in my case.

这篇关于计算Spark数据帧的大小-SizeEstimator提供意外结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆