火花高阶函数变换输出结构 [英] spark higher order function transform output struct
问题描述
我如何transform
使用火花高级函数将结构数组再次转换为结构?
How can I transform
an array of structs to again a struct using spark higher order functions?
数据集:
case class Foo(thing1:String, thing2:String, thing3:String)
case class Baz(foo:Foo, other:String)
case class Bar(id:Int, bazes:Seq[Baz])
import spark.implicits._
val df = Seq(Bar(1, Seq(Baz(Foo("first", "second", "third"), "other"), Baz(Foo("1", "2", "3"), "else")))).toDF
df.printSchema
df.show(false)
我想连接所有thing1, thign2, thing3
,但保留每个bar
的other
属性.
I want to concatenate all thing1, thign2, thing3
but keep the other
property for each bar
.
一个简单的:
scala> df.withColumn("cleaned", expr("transform(bazes, x -> x)")).printSchema
root
|-- id: integer (nullable = false)
|-- bazes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
|-- cleaned: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
只会将内容复制过来.
所需的连续操作:
df.withColumn("cleaned", expr("transform(bazes, x -> concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3))")).printSchema
不幸的是,
将从other
列中删除所有值:
will, unfortunately, remove all the values form the other
column:
+---+----------------------------------------------------+-------------------------------+
|id |bazes |cleaned |
+---+----------------------------------------------------+-------------------------------+
|1 |[[[first, second, third], other], [[1, 2, 3], else]]|[first::second::third, 1::2::3]|
+---+----------------------------------------------------+-------------------------------+
如何保留这些? 尝试保留元组:
How can these be retained? Trying to keep the tuples:
df.withColumn("cleaned", expr("transform(bazes, x -> (concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3), x.other))")).printSchema
失败:
.AnalysisException: cannot resolve 'named_struct('col1', concat(namedlambdavariable().`foo`.`thing1`, '::', namedlambdavariable().`foo`.`thing2`, '::', namedlambdavariable().`foo`.`thing3`), NamePlaceholder(), namedlambdavariable().`other`)' due to data type mismatch: Only foldable string expressions are allowed to appear at odd position, got: NamePlaceholder; line 1 pos 22;
编辑
所需的输出:
edit
The desired output:
-
包含内容的新列:
a new column with contents:
[[[first :: second :: third,other],[1 :: 2 :: 3,else]
[[first::second::third, other], [1::2::3,else]
保留列other
推荐答案
通过这种方式,您可以实现所需的输出.您不能直接访问其他值bcoz foo,而其他共享相同的层次结构.因此您需要单独访问其他.
In this way, you can achieve your desired output. you cannot directly access other value bcoz foo and other are sharing the same hierarchy. so you need to access other separately.
scala> df.withColumn("cleaned", expr("transform(bazes, x -> struct(concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3),cast(x.other as string)))")).show(false)
+---+----------------------------------------------------+------------------------------------------------+
|id |bazes |cleaned |
+---+----------------------------------------------------+------------------------------------------------+
printSchema
printSchema
scala> df.withColumn("cleaned", expr("transform(bazes, x -> struct(concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3),cast(x.other as string)))")).printSchema
root
|-- id: integer (nullable = false)
|-- bazes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- foo: struct (nullable = true)
| | | |-- thing1: string (nullable = true)
| | | |-- thing2: string (nullable = true)
| | | |-- thing3: string (nullable = true)
| | |-- other: string (nullable = true)
|-- cleaned: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = true)
| | |-- col2: string (nullable = true)
让我知道您是否还有其他与此问题有关的问题.
let me know if you have further any question related to the same.
这篇关于火花高阶函数变换输出结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!