如何取消旋转大型火花数据框? [英] How to unpivot a large spark dataframe?

查看:39
本文介绍了如何取消旋转大型火花数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当列数相当少并且列的名称可以硬编码时,我已经看到了一些unpivot 火花数据帧的解决方案.您是否有可扩展的解决方案来对包含多列的数据框进行逆透视?

I have seen a few solutions to unpivot a spark dataframe when the number of columns is reasonably low and that the columns' names can be hardcoded. Do you have a scalable solution to unpivot a dataframe with numerous columns?

下面是一个玩具问题.

输入:

  val df = Seq(
    (1,1,1,0),
    (2,0,0,1)    
  ).toDF("ID","A","B","C")

+---+--------+----+
| ID|  A | B | C  |
+---+--------+-----
|  1|  1 | 1 | 0  |
|  2|  0 | 0 | 1  |
+---+----------+--+

预期结果:

+---+-----+-----+
| ID|names|count|
+---+-----------|
|  1|  A  |  1  |
|  1|  B  |  1  |
|  1|  C  |  0  |
|  2|  A  |  0  |
|  2|  B  |  0  |
|  2|  C  |  1  |
+---+-----------+

该解决方案应该适用于具有 N 列的数据集进行逆透视,其中 N 很大(比如 100 列).

The solution should be applicable to datasets with N columns to unpivot, where N is large (say 100 columns).

推荐答案

这应该可行,我假设您知道要取消透视的列列表

This should work, I am assuming you know the list of columns that you want to unpivot on

import org.apache.spark.sql.{functions => func, _}

val df = Seq(
    (1,1,1,0),
    (2,0,0,1)    
  ).toDF("ID","A","B","C")

val cols = Seq("A", "B", "C")

df.select(
    $"ID",
    func.explode(
        func.array(
            cols.map(
                col =>
                    func.struct(    
                        func.lit(col).alias("names"),
                        func.col(col).alias("count")
                    )
            ): _*
        )
    ).alias("v")
)
.selectExpr("ID", "v.*")

这篇关于如何取消旋转大型火花数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆