将行值转换为带有Spark Scala中另一列的值的列 [英] Convert row values into columns with its value from another column in spark scala
问题描述
我正在尝试将行中的值与另一列中的值转换为不同的列.例如-
I'm trying to convert values from row into different columns with its value from another column. For example -
输入数据框就像-
+-----------+
| X | Y | Z |
+-----------+
| 1 | A | a |
| 2 | A | b |
| 3 | A | c |
| 1 | B | d |
| 3 | B | e |
| 2 | C | f |
+-----------+
输出数据帧应如下所示-
And the output dataframe should be like this -
+------------------------+
| Y | 1 | 2 | 3 |
+------------------------+
| A | a | b | c |
| B | d | null | e |
| C | null | f | null |
+------------------------+
我尝试将基于Y的值和X和Z上的collect_list分组,然后压缩X&Z一起获得某种键值对.但是某些Y值可能会缺少一些X,因此为了用空值填充它们,我交叉联接了X的所有可能值和Y的所有可能值,然后将其联接到原始数据帧中.这种方法效率很低.
I've tried to groupBy the values based on Y and collect_list on X and Z and then zipped X & Z together to get some sort of key-value pairs. But some Xs may be missing for some values of Y so in order to fill them with null values, I cross joined all possible values of X and all possible values of Y and then joined it my original dataframe. This is approach is highly inefficient.
有没有有效的方法来解决此问题?提前致谢.
Is there any efficient method to approach this problem ? Thanks in advance.
推荐答案
您可以简单地将 groupBy
与 pivot
和 first
用作聚合函数
You can simply use groupBy
with pivot
and first
as aggregate function as
import org.apache.spark.sql.functions._
df.groupBy("Y").pivot("X").agg(first("z"))
输出:
+---+----+----+----+
|Y |1 |2 |3 |
+---+----+----+----+
|B |d |null|e |
|C |null|f |null|
|A |a |b |c |
+---+----+----+----+
这篇关于将行值转换为带有Spark Scala中另一列的值的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!