如何减少Spark中的多个联接 [英] How to reduce multiple joins in spark

查看:75
本文介绍了如何减少Spark中的多个联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark 2.4.1,以确定数据框上的某些比率.

I am using Spark 2.4.1, to figure out some ratios on my data frame.

在需要查找比率的不同比率因子的情况下,通过加入元数据框架(即 resDs )来找到给定数据框架( df_data )中的不同列.

Where I need to find different ratio factors of ratios, different columns in given data frame(df_data) by joining to meta dataframe (i.e. resDs).

我通过使用三个具有不同联接条件的不同联接来获得这些比率因子(即 ratio_1_factor ratio_2_factor 和amp; ratio_3_factor )code> joinedDs , joinedDs2 joinedDs3

I am getting these ratio factors (i.e. ratio_1_factor, ratio_2_factor & ratio_3_factor) by using three different joins with different join conditions i.e. joinedDs , joinedDs2, joinedDs3

是否有其他选择来减少连接数?使它工作最佳?

Is there any other alternative to reduce the number of joins ?? make it work optimum?

您可以在下面的公共URL中找到整个示例数据.

You can find the entire sample data in the below public URL.

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1165111237342523/3521103084252405/7035720262824085/latest.html

在when子句中如何处理多步而不是单步:

.withColumn("step_1_ratio_1", (col("ratio_1").minus(lit(0.00000123))).cast(DataTypes.DoubleType)) // step-2
      .withColumn("step_2_ratio_1", (col("step_1_ratio_1").multiply(lit(0.02))).cast(DataTypes.DoubleType)) //step-3
      .withColumn("step_3_ratio_1", (col("step_2_ratio_1").divide(col("step_1_ratio_1"))).cast(DataTypes.DoubleType)) //step-4
      .withColumn("ratio_1_factor", (col("ratio_1_factor")).cast(DataTypes.DoubleType)) //step-5

即"ratio_1_factor"根据数据帧中的其他各列df_data

i.e. "ratio_1_factor" calucated based on various other columns in the dataframe , df_data

这些步骤-2、3、4也在其他ratio_factors计算中使用.即ratio_2_factor,ratio_2_factor应该如何处理?

these steps -2,3,4 , are being used in other ratio_factors calculation too. i.e. ratio_2_factor, ratio_2_factor how this should be handled ?

推荐答案

您可以一次加入并计算 ratio_1_factor ratio_2_factor ratio_3_factor 汇总中使用 max when 函数的列:

You can join one time and calculate ratio_1_factor, ratio_2_factor and ratio_3_factor columns using max and when function in aggregation :

val joinedDs = df_data.as("aa")
  .join(
    broadcast(resDs.as("bb")),
    col("aa.g_date").between(col("bb.start_date"), col("bb.end_date"))
  )
  .groupBy("item_id", "g_date", "ratio_1", "ratio_2", "ratio_3")
  .agg(
    max(when(
        col("aa.ratio_1").between(col("bb.A"), col("bb.A_lead")),
        col("ratio_1").multiply(lit(0.1))
      )
    ).cast(DoubleType).as("ratio_1_factor"),
    max(when(
        col("aa.ratio_2").between(col("bb.A"), col("bb.A_lead")),
        col("ratio_2").multiply(lit(0.2))
      )
    ).cast(DoubleType).as("ratio_2_factor"),
    max(when(
        col("aa.ratio_3").between(col("bb.A"), col("bb.A_lead")),
        col("ratio_3").multiply(lit(0.3))
      )
    ).cast(DoubleType).as("ratio_3_factor")
  )


joinedDs.show(false)

//+-------+----------+---------+-----------+-----------+---------------------+--------------+--------------+
//|item_id|g_date    |ratio_1  |ratio_2    |ratio_3    |ratio_1_factor       |ratio_2_factor|ratio_3_factor|
//+-------+----------+---------+-----------+-----------+---------------------+--------------+--------------+
//|50312  |2016-01-04|0.0456646|0.046899415|0.046000415|0.0045664600000000005|0.009379883   |0.0138001245  |
//+-------+----------+---------+-----------+-----------+---------------------+--------------+--------------+

这篇关于如何减少Spark中的多个联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆