分解(转置?)Spark SQL表中的多个列 [英] Explode (transpose?) multiple columns in Spark SQL table

查看：226 发布时间：2020/9/3 23:28:37 sql apache-spark apache-spark-sql hiveql

本文介绍了分解(转置?)Spark SQL表中的多个列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Spark SQL(我提到它在Spark中，以防影响SQL语法-我尚不十分确定，尚不能确定)，并且我有一个表试图进行重组，但是我在尝试同时转置多列时陷入困境.

I am using Spark SQL (I mention that it is in Spark in case that affects the SQL syntax - I'm not familiar enough to be sure yet) and I have a table that I am trying to re-structure, but I'm getting stuck trying to transpose multiple columns at the same time.

基本上我的数据如下:

userId    someString      varA     varB
   1      "example1"    [0,2,5]   [1,2,9]
   2      "example2"    [1,20,5]  [9,null,6]

并且我想同时爆炸varA和varB(长度将始终保持一致)-这样最终输出将如下所示:

and I'd like to explode both varA and varB simultaneously (the length will always be consistent) - so that the final output looks like this:

userId    someString      varA     varB
   1      "example1"       0         1
   1      "example1"       2         2
   1      "example1"       5         9
   2      "example2"       1         9
   2      "example2"       20       null
   2      "example2"       5         6

但是我似乎只能在一个命令中使用单个explode(var)语句，如果我尝试将它们链接起来(即在第一个explode命令之后创建一个临时表)，那么我显然会得到很多重复的不必要行.

but I can only seem to get a single explode(var) statement to work in one command, and if I try to chain them (ie create a temp table after the first explode command) then I obviously get a huge number of duplicate, unnecessary rows.

非常感谢！

推荐答案

火花> = 2.4

您可以跳过zip udf并使用arrays_zip功能:

You can skip zip udf and use arrays_zip function:

df.withColumn("vars", explode(arrays_zip($"varA", $"varB"))).select(
  $"userId", $"someString",
  $"vars.varA", $"vars.varB").show

火花< 2.4

如果没有自定义UDF，就无法实现所需的内容.在Scala中，您可以执行以下操作:

What you want is not possible without a custom UDF. In Scala you could do something like this:

val data = sc.parallelize(Seq(
    """{"userId": 1, "someString": "example1",
        "varA": [0, 2, 5], "varB": [1, 2, 9]}""",
    """{"userId": 2, "someString": "example2",
        "varA": [1, 20, 5], "varB": [9, null, 6]}"""
))

val df = spark.read.json(data)

df.printSchema
// root
//  |-- someString: string (nullable = true)
//  |-- userId: long (nullable = true)
//  |-- varA: array (nullable = true)
//  |    |-- element: long (containsNull = true)
//  |-- varB: array (nullable = true)
//  |    |-- element: long (containsNull = true)

现在我们可以定义zip udf:

Now we can define zip udf:

import org.apache.spark.sql.functions.{udf, explode}

val zip = udf((xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))

df.withColumn("vars", explode(zip($"varA", $"varB"))).select(
   $"userId", $"someString",
   $"vars._1".alias("varA"), $"vars._2".alias("varB")).show

// +------+----------+----+----+
// |userId|someString|varA|varB|
// +------+----------+----+----+
// |     1|  example1|   0|   1|
// |     1|  example1|   2|   2|
// |     1|  example1|   5|   9|
// |     2|  example2|   1|   9|
// |     2|  example2|  20|null|
// |     2|  example2|   5|   6|
// +------+----------+----+----+

使用原始SQL:

sqlContext.udf.register("zip", (xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))
df.registerTempTable("df")

sqlContext.sql(
  """SELECT userId, someString, explode(zip(varA, varB)) AS vars FROM df""")

这篇关于分解(转置?)Spark SQL表中的多个列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

分解(转置?)Spark SQL表中的多个列 [英] Explode (transpose?) multiple columns in Spark SQL table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

分解(转置?)Spark SQL表中的多个列 [英] Explode (transpose?) multiple columns in Spark SQL table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭