如何减少Spark中的多个联接 [英] How to reduce multiple joins in spark
问题描述
我正在使用Spark 2.4.1,以确定数据框上的某些比率.
I am using Spark 2.4.1, to figure out some ratios on my data frame.
在需要查找比率的不同比率因子的情况下,通过加入元数据框架(即 resDs
)来找到给定数据框架( df_data
)中的不同列.
Where I need to find different ratio factors of ratios, different columns in given data frame(df_data
) by joining to meta dataframe (i.e. resDs
).
我通过使用三个具有不同联接条件的不同联接来获得这些比率因子(即 ratio_1_factor
, ratio_2_factor
和amp; ratio_3_factor
)code> joinedDs , joinedDs2
, joinedDs3
I am getting these ratio factors (i.e. ratio_1_factor
, ratio_2_factor
& ratio_3_factor
) by using three different joins with different join conditions i.e. joinedDs
, joinedDs2
, joinedDs3
是否有其他选择来减少连接数?使它工作最佳?
Is there any other alternative to reduce the number of joins ?? make it work optimum?
您可以在下面的公共URL中找到整个示例数据.
You can find the entire sample data in the below public URL.
在when子句中如何处理多步而不是单步:
.withColumn("step_1_ratio_1", (col("ratio_1").minus(lit(0.00000123))).cast(DataTypes.DoubleType)) // step-2
.withColumn("step_2_ratio_1", (col("step_1_ratio_1").multiply(lit(0.02))).cast(DataTypes.DoubleType)) //step-3
.withColumn("step_3_ratio_1", (col("step_2_ratio_1").divide(col("step_1_ratio_1"))).cast(DataTypes.DoubleType)) //step-4
.withColumn("ratio_1_factor", (col("ratio_1_factor")).cast(DataTypes.DoubleType)) //step-5
即"ratio_1_factor"根据数据帧中的其他各列df_data
i.e. "ratio_1_factor" calucated based on various other columns in the dataframe , df_data
这些步骤-2、3、4也在其他ratio_factors计算中使用.即ratio_2_factor,ratio_2_factor应该如何处理?
these steps -2,3,4 , are being used in other ratio_factors calculation too. i.e. ratio_2_factor, ratio_2_factor how this should be handled ?
推荐答案
您可以一次加入并计算 ratio_1_factor
, ratio_2_factor
和 ratio_3_factor
汇总中使用 max
和 when
函数的列:
You can join one time and calculate ratio_1_factor
, ratio_2_factor
and ratio_3_factor
columns using max
and when
function in aggregation :
val joinedDs = df_data.as("aa")
.join(
broadcast(resDs.as("bb")),
col("aa.g_date").between(col("bb.start_date"), col("bb.end_date"))
)
.groupBy("item_id", "g_date", "ratio_1", "ratio_2", "ratio_3")
.agg(
max(when(
col("aa.ratio_1").between(col("bb.A"), col("bb.A_lead")),
col("ratio_1").multiply(lit(0.1))
)
).cast(DoubleType).as("ratio_1_factor"),
max(when(
col("aa.ratio_2").between(col("bb.A"), col("bb.A_lead")),
col("ratio_2").multiply(lit(0.2))
)
).cast(DoubleType).as("ratio_2_factor"),
max(when(
col("aa.ratio_3").between(col("bb.A"), col("bb.A_lead")),
col("ratio_3").multiply(lit(0.3))
)
).cast(DoubleType).as("ratio_3_factor")
)
joinedDs.show(false)
//+-------+----------+---------+-----------+-----------+---------------------+--------------+--------------+
//|item_id|g_date |ratio_1 |ratio_2 |ratio_3 |ratio_1_factor |ratio_2_factor|ratio_3_factor|
//+-------+----------+---------+-----------+-----------+---------------------+--------------+--------------+
//|50312 |2016-01-04|0.0456646|0.046899415|0.046000415|0.0045664600000000005|0.009379883 |0.0138001245 |
//+-------+----------+---------+-----------+-----------+---------------------+--------------+--------------+
这篇关于如何减少Spark中的多个联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!