如何修改具有复杂嵌套结构的 Spark Dataframe? [英] How to modify a Spark Dataframe with a complex nested structure?

查看:37
本文介绍了如何修改具有复杂嵌套结构的 Spark Dataframe?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个复杂的 DataFrame 结构,想轻松地将列清空.我已经创建了连接功能并轻松处理 2D DataFrame 结构的隐式类,但是一旦 DataFrame 变得与 ArrayType 或 MapType 变得更加复杂,我就没有多少运气了.例如:

I've a complex DataFrame structure and would like to null a column easily. I've created implicit classes that wire functionality and easily address 2D DataFrame structures but once the DataFrame becomes more complicated with ArrayType or MapType I've not had much luck. For example:

我将架构定义为:

StructType(
    StructField(name,StringType,true), 
    StructField(data,ArrayType(
        StructType(
            StructField(name,StringType,true), 
            StructField(values,
                MapType(StringType,StringType,true),
            true)
        ),
        true
    ),
    true)
)

我想生成一个新的 DF,其中 MapType 的 data.value 字段设置为 null,但由于这是数组的一个元素,我无法弄清楚如何.我认为它类似于:

I'd like to produce a new DF that has the field data.value of MapType set to null, but as this is an element of an array I have not been able to figure out how. I would think it would be similar to:

df.withColumn("data.values", functions.array(functions.lit(null)))

但这最终会创建一个新的 data.values 列,并且不会修改数据数组的 values 元素.

but this ultimately creates a new column of data.values and does not modify the values element of the data array.

推荐答案

从 Spark 1.6 开始,您可以使用案例类来映射您的数据框(称为数据集).然后,您可以映射您的数据并将其转换为您想要的新模式.例如:

Since Spark 1.6, you can use case classes to map your dataframes (called datasets). Then, you can map your data and transform it to the new schema you want. For example:

case class Root(name: String, data: Seq[Data])
case class Data(name: String, values: Map[String, String])
case class NullableRoot(name: String, data: Seq[NullableData])
case class NullableData(name: String, value: Map[String, String], values: Map[String, String])

val nullableDF = df.as[Root].map { root =>
  val nullableData = root.data.map(data => NullableData(data.name, null, data.values))
  NullableRoot(root.name, nullableData)
}.toDF()

nullableDF 的结果模式将是:

root
 |-- name: string (nullable = true)
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- values: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)

这篇关于如何修改具有复杂嵌套结构的 Spark Dataframe?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆