如何在pyspark中进行广播连接之前获取数据帧的大小 [英] How to get the size of a data frame before doing the broadcast join in pyspark

查看:24
本文介绍了如何在pyspark中进行广播连接之前获取数据帧的大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 spark 新手,我想做一个广播连接,在此之前我试图获得我想要广播的数据帧的大小..

I am new to spark ,I want to do a broadcast join and before that i am trying to get the size of my data frame that i want to broadcast..

无论如何都可以找到数据框的大小.

Is there anyway to find the size of a data frame .

我使用 Python 作为我的 spark 编程语言

I am using Python as my programming language for spark

非常感谢任何帮助

推荐答案

如果您正在寻找以字节为单位的大小以及以行数为单位的大小,请遵循此-

If you are looking for size in bytes as well as size in row count follow this-

 // ### Alternative -1
    /**
      * file content
      * spark-test-data.json
      * --------------------
      * {"id":1,"name":"abc1"}
      * {"id":2,"name":"abc2"}
      * {"id":3,"name":"abc3"}
      */
    val fileName = "spark-test-data.json"
    val path = getClass.getResource("/" + fileName).getPath

    spark.catalog.createTable("df", path, "json")
      .show(false)

    /**
      * +---+----+
      * |id |name|
      * +---+----+
      * |1  |abc1|
      * |2  |abc2|
      * |3  |abc3|
      * +---+----+
      */
    // Collect only statistics that do not require scanning the whole table (that is, size in bytes).
    spark.sql("ANALYZE TABLE df COMPUTE STATISTICS NOSCAN")
    spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)

    /**
      * +----------+---------+-------+
      * |col_name  |data_type|comment|
      * +----------+---------+-------+
      * |Statistics|68 bytes |       |
      * +----------+---------+-------+
      */
    spark.sql("ANALYZE TABLE df COMPUTE STATISTICS")
    spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)

    /**
      * +----------+----------------+-------+
      * |col_name  |data_type       |comment|
      * +----------+----------------+-------+
      * |Statistics|68 bytes, 3 rows|       |
      * +----------+----------------+-------+
      */

替代方案 2


    // ### Alternative 2

    val df = spark.range(10)
    df.createOrReplaceTempView("myView")
    spark.sql("explain cost select * from myView").show(false)

    /**
      * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |plan                                                                                                                                                                    |
      * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      * |== Optimized Logical Plan ==
      * Range (0, 10, step=1, splits=Some(2)), Statistics(sizeInBytes=80.0 B, hints=none)
      *
      * == Physical Plan ==
      * *(1) Range (0, 10, step=1, splits=2)|
      * +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      */

替代方案 3

    // ### altervative 3
    println(spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats.sizeInBytes) 
// 80

这篇关于如何在pyspark中进行广播连接之前获取数据帧的大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆