将聚合列添加到 Spark DataFrame [英] Add Aggregate Column to Spark DataFrame
问题描述
我有一个如下所示的 Spark DataFrame:
I have a Spark DataFrame that looks like:
| id | value | bin |
|----+-------+-----|
| 1 | 3.4 | 2 |
| 2 | 2.6 | 1 |
| 3 | 1.8 | 1 |
| 4 | 9.6 | 2 |
我有一个函数 f
接受一个值数组并返回一个数字.我想在上面的数据框中添加一列,其中每一行中新列的值是 f
的值,所有 value
条目都具有相同的 bin
条目,即:
I have a function f
that takes an array of values and returns a number. I want to add a column to the above data frame where the value for the new column in each row is the value of f
for all the value
entries that have the same bin
entry, i.e:
| id | value | bin | f_value |
|----+-------+-----+---------------|
| 1 | 3.4 | 2 | f([3.4, 9.6]) |
| 2 | 2.6 | 1 | f([2.6, 1.8]) |
| 3 | 1.8 | 1 | f([2.6, 1.8]) |
| 4 | 9.6 | 2 | f([3.4, 9.6]) |
因为我需要聚合每个 bin
的所有 value
,所以我不能使用 withColumn
函数来添加这个新列.在用户定义的聚合函数进入 Spark 之前,最好的方法是什么?
Since I need to aggregate all value
s per bin
, I cannot use the withColumn
function to add this new column. What is the best way to do this until user defined aggregation functions make there way into Spark?
推荐答案
以下代码未经测试,只是一个想法.
Below code is not tested, but just an idea.
在 Hive 中,可以像这样使用 collect_list 函数.
In Hive, it can be done like this using collect_list function.
val newDF = sqlContext.sql(
"select bin, collect_list() from aboveDF group by bin")
Next join
aboveDF
和 newDF
在 bin 上.
Next join
aboveDF
and newDF
on bin.
这是您要找的吗?
这篇关于将聚合列添加到 Spark DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!