如何将计算的百分位数包含/映射到结果数据框? [英] How to include/map calculated percentiles to the result dataframe?
问题描述
我正在使用 spark-sql-2.4.1v,并且我正在尝试在给定数据的每一列上查找分位数,即百分位数 0、百分位数 25 等.
I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles, i.e. percentile 0, percentile 25, etc, on each column of my given data.
当我在做多个百分位数时,如何从结果中检索每个计算出的百分位数?
As I am doing multiple percentiles, how to retrieve each calculated percentile from the results?
我的数据框df
:
+----+---------+-------------+----------+-----------+
| id| date| revenue|con_dist_1| con_dist_2|
+----+---------+-------------+----------+-----------+
| 10|1/15/2018| 0.010680705| 6|0.019875458|
| 10|1/15/2018| 0.006628853| 4|0.816039063|
| 10|1/15/2018| 0.01378215| 4|0.082049528|
| 10|1/15/2018| 0.010680705| 6|0.019875458|
| 10|1/15/2018| 0.006628853| 4|0.816039063|
+----+---------+-------------+----------+-----------+
我需要获得如下预期的输出/结果:
I need to get expected output/result as below:
+----+---------+-------------+-------------+------------+-------------+
| id| date| revenue| perctile_col| quantile_0 |quantile_10 |
+----+---------+-------------+-------------+------------+-------------+
| 10|1/15/2018| 0.010680705| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.010680705| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.01378215| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.01378215| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.010680705| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.010680705| con_dist_2 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_1 |<quant0_val>|<quant10_val>|
| 10|1/15/2018| 0.006628853| con_dist_2 |<quant0_val>|<quant10_val>|
+----+---------+-------------+-------------+------------+-------------+
我已经像这样计算了分位数,但需要将它们添加到输出数据框中:
I have already calculated the quantiles like this but need to add them to the output dataframe:
val col_list = Array("con_dist_1","con_dist_2")
val quantiles = df.stat.approxQuantile(col_list, Array(0.0,0.1,0.5),0.0)
val percentile_0 = 0;
val percentile_10 = 1;
val Q0 = quantiles(col_list.indexOf("con_dist_1"))(percentile_0)
val Q10 =quantiles(col_list.indexOf("con_dist_1"))(percentile_10)
如何获得上面显示的预期输出?
How to get expected output show above?
推荐答案
一个简单的解决方案是创建多个数据框,每个con_dist"列一个,然后使用 union
将它们合并在一起.这可以使用 col_list
上的 map
轻松完成,如下所示:
An easy solution would be to create multiple dataframes, one for each "con_dist" column, and then use union
to merge them together. This can easily be done using a map
over col_list
as follows:
val col_list = Array("con_dist_1", "con_dist_2")
val quantiles = df.stat.approxQuantile(col_list, Array(0.0,0.1,0.5), 0.0)
val df2 = df.drop(col_list: _*) // we don't need these columns anymore
val result = col_list
.zipWithIndex
.map{case (col, colIndex) =>
val Q0 = quantiles(colIndex)(percentile_0)
val Q10 = quantiles(colIndex)(percentile_10)
df2.withColumn("perctile_col", lit(col))
.withColumn("quantile_0", lit(Q0))
.withColumn("quantile_10", lit(Q10))
}.reduce(_.union(_))
最终的数据帧将是:
+---+---------+-----------+------------+-----------+-----------+
| id| date| revenue|perctile_col| quantile_0|quantile_10|
+---+---------+-----------+------------+-----------+-----------+
| 10|1/15/2018|0.010680705| con_dist_1| 4.0| 4.0|
| 10|1/15/2018|0.006628853| con_dist_1| 4.0| 4.0|
| 10|1/15/2018| 0.01378215| con_dist_1| 4.0| 4.0|
| 10|1/15/2018|0.010680705| con_dist_1| 4.0| 4.0|
| 10|1/15/2018|0.006628853| con_dist_1| 4.0| 4.0|
| 10|1/15/2018|0.010680705| con_dist_2|0.019875458|0.019875458|
| 10|1/15/2018|0.006628853| con_dist_2|0.019875458|0.019875458|
| 10|1/15/2018| 0.01378215| con_dist_2|0.019875458|0.019875458|
| 10|1/15/2018|0.010680705| con_dist_2|0.019875458|0.019875458|
| 10|1/15/2018|0.006628853| con_dist_2|0.019875458|0.019875458|
+---+---------+-----------+------------+-----------+-----------+
这篇关于如何将计算的百分位数包含/映射到结果数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!