分解Spark SQL表中的多个列 [英] Explode multiple columns in Spark SQL table

查看:189
本文介绍了分解Spark SQL表中的多个列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里有一个与此问题有关的问题:

There was a question regarding this issue here:

在Spark SQL表中展开(转置?)多列

假设我们还有额外的列,如下所示:

Suppose that we have extra columns as below:

**userId    someString      varA     varB      varC    varD**
   1        "example1"    [0,2,5]   [1,2,9]    [a,b,c] [red,green,yellow]
   2        "example2"    [1,20,5]  [9,null,6] [d,e,f] [white,black,cyan]

总结如下输出:

userId    someString      varA     varB   varC     varD
   1      "example1"       0         1     a       red
   1      "example1"       2         2     b       green
   1      "example1"       5         9     c       yellow
   2      "example2"       1         9     d       white
   2      "example2"       20       null   e       black
   2      "example2"       5         6     f       Cyan

答案是通过将udf定义为:

val zip = udf((xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))

并定义"withColumn".

and defining "withColumn".

df.withColumn("vars", explode(zip($"varA", $"varB"))).select(
   $"userId", $"someString",
   $"vars._1".alias("varA"), $"vars._2".alias("varB")).show

如果我们需要用更多的列来扩展以上答案,最简单的方法就是修改以上代码.请帮忙.

If we need to extend the above answer, with more columns, what is the easiest way to amend the above code. Any help please.

推荐答案

使用zip udf的方法似乎可以,但是如果需要更多集合,则需要扩展.不幸的是,没有一种真正好的压缩4个Seq的方法,但这应该可以工作:

The approach with the zip udf seems ok, but you need to extend if for more collections. Unfortunately there is no really nice way to zip 4 Seqs, but this should work:

def assertSameSize(arrs:Seq[_]*) = {
 assert(arrs.map(_.size).distinct.size==1,"sizes differ") 
}

val zip4 = udf((xa:Seq[Long],xb:Seq[Long],xc:Seq[String],xd:Seq[String]) => {
    assertSameSize(xa,xb,xc,xd)
    xa.indices.map(i=> (xa(i),xb(i),xc(i),xd(i)))
  }
)

这篇关于分解Spark SQL表中的多个列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆