在Spark数据框中爆炸嵌套的Struct [英] Exploding nested Struct in Spark dataframe
本文介绍了在Spark数据框中爆炸嵌套的Struct的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
I'm working through the Databricks example. The schema for the dataframe looks like:
> parquetDF.printSchema
root
|-- department: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
|-- employees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- firstName: string (nullable = true)
| | |-- lastName: string (nullable = true)
| | |-- email: string (nullable = true)
| | |-- salary: integer (nullable = true)
在该示例中,它们显示了如何将雇员"列分解为另外4个列:
In the example, they show how to explode the employees column into 4 additional columns:
val explodeDF = parquetDF.explode($"employees") {
case Row(employee: Seq[Row]) => employee.map{ employee =>
val firstName = employee(0).asInstanceOf[String]
val lastName = employee(1).asInstanceOf[String]
val email = employee(2).asInstanceOf[String]
val salary = employee(3).asInstanceOf[Int]
Employee(firstName, lastName, email, salary)
}
}.cache()
display(explodeDF)
我将如何对Department列进行类似的操作(即在数据框中添加另外两个名为"id"和"name"的列)?方法并不完全相同,我只能找出如何使用以下方法创建全新的数据框:
How would I do something similar with the department column (i.e. add two additional columns to the dataframe called "id" and "name")? The methods aren't exactly the same, and I can only figure out how to create a brand new data frame using:
val explodeDF = parquetDF.select("department.id","department.name")
display(explodeDF)
如果我尝试:
val explodeDF = parquetDF.explode($"department") {
case Row(dept: Seq[String]) => dept.map{dept =>
val id = dept(0)
val name = dept(1)
}
}.cache()
display(explodeDF)
我得到警告和错误:
<console>:38: warning: non-variable type argument String in type pattern Seq[String] is unchecked since it is eliminated by erasure
case Row(dept: Seq[String]) => dept.map{dept =>
^
<console>:37: error: inferred type arguments [Unit] do not conform to method explode's type parameter bounds [A <: Product]
val explodeDF = parquetDF.explode($"department") {
^
推荐答案
您可以使用类似的方法:
You could use something like that:
var explodeDF = explodeDF.withColumn("id", explodeDF("department.id"))
explodeDeptDF = explodeDeptDF.withColumn("name", explodeDeptDF("department.name"))
您帮助我解决了以下问题:
which you helped me into and these questions:
- Flattening Rows in Spark
- Spark 1.4.1 DataFrame explode list of JSON objects
这篇关于在Spark数据框中爆炸嵌套的Struct的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文