如何估算Spark DataFrame中每列的大小(以字节为单位)? [英] How can I estimate the size in bytes of each column in a Spark DataFrame?

查看:272
本文介绍了如何估算Spark DataFrame中每列的大小(以字节为单位)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的Spark DataFrame,其中包含许多列,我想就是否将其保留在我的管道中做出明智的判断,部分取决于它们的大小.多大"是指缓存此DataFrame时RAM中字节的大小,我希望这是处理该数据的计算成本的不错估计.有些列是简单类型(例如,双精度数,整数),而另一些列是复杂类型(例如,可变长度的数组和映射).

I have a very large Spark DataFrame with a number of columns, and I want to make an informed judgement about whether or not to keep them in my pipeline, in part based on how big they are. By "how big," I mean the size in bytes in RAM when this DataFrame is cached, which I expect to be a decent estimate for the computational cost of processing this data. Some columns are simple types (e.g. doubles, integers) but others are complex types (e.g. arrays and maps of variable length).

我尝试过的一种方法是在没有问题的情况下缓存DataFrame,然后在有问题的列中进行缓存,在Spark UI中查看存储"选项卡,然后进行区别.但这对于具有许多列的DataFrame是一个烦人且缓慢的练习.

An approach I have tried is to cache the DataFrame without and then with the column in question, check out the Storage tab in the Spark UI, and take the difference. But this is an annoying and slow exercise for a DataFrame with a lot of columns.

我通常使用PySpark,因此最好使用PySpark答案,但Scala也可以.

I typically use PySpark so a PySpark answer would be preferable, but Scala would be fine as well.

推荐答案

我找到了一个基于以下相关答案的解决方案: https://stackoverflow.com/a/49529028 .

I found a solution which builds off of this related answer: https://stackoverflow.com/a/49529028.

假设我正在使用一个名为 df 的数据框和一个名为 spark SparkSession 对象:

Assuming I'm working with a dataframe called df and a SparkSession object called spark:

import org.apache.spark.sql.{functions => F}

// force the full dataframe into memory (could specify persistence
// mechanism here to ensure that it's really being cached in RAM)
df.cache()
df.count()

// calculate size of full dataframe
val catalystPlan = df.queryExecution.logical
val dfSizeBytes = spark.sessionState.executePlan(catalystPlan).optimizedPlan.stats.sizeInBytes

for (col <- df.columns) {
    println("Working on " + col)

    // select all columns except this one:
    val subDf = df.select(df.columns.filter(_ != col).map(F.col): _*)

    // force subDf into RAM
    subDf.cache()
    subDf.count()

    // calculate size of subDf
    val catalystPlan = subDf.queryExecution.logical
    val subDfSizeBytes = spark.sessionState.executePlan(catalystPlan).optimizedPlan.stats.sizeInBytes

    // size of this column as a fraction of full dataframe
    val colSizeFrac = (dfSizeBytes - subDfSizeBytes).toDouble / dfSizeBytes.toDouble
    println("Column space fraction is " + colSizeFrac * 100.0 + "%")
    subDf.unpersist()
}

一些证实这种方法产生了明智的结果:

Some confirmations that this approach gives sensible results:

  1. 报告的列大小总计为100%.
  2. 简单类型的列(如整数或双精度)每行占用预期的4个字节或8个字节.

这篇关于如何估算Spark DataFrame中每列的大小(以字节为单位)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆