如何使用复杂的嵌套结构修改Spark Dataframe? [英] How to modify a Spark Dataframe with a complex nested structure?

查看:231
本文介绍了如何使用复杂的嵌套结构修改Spark Dataframe?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个复杂的DataFrame结构,想轻松地使一列为空.我创建了隐式类,它们可以连接功能并轻松解决2D DataFrame结构的问题,但是一旦DataFrame使用ArrayType或MapType变得更加复杂,我就没有太多的运气了.例如:

I've a complex DataFrame structure and would like to null a column easily. I've created implicit classes that wire functionality and easily address 2D DataFrame structures but once the DataFrame becomes more complicated with ArrayType or MapType I've not had much luck. For example:

我的架构定义为:

StructType(
    StructField(name,StringType,true), 
    StructField(data,ArrayType(
        StructType(
            StructField(name,StringType,true), 
            StructField(values,
                MapType(StringType,StringType,true),
            true)
        ),
        true
    ),
    true)
)

我想生成一个新的DF,该Map的MapType的字段data.value设置为null,但是由于这是数组的元素,所以我无法弄清楚该怎么做.我认为这类似于:

I'd like to produce a new DF that has the field data.value of MapType set to null, but as this is an element of an array I have not been able to figure out how. I would think it would be similar to:

df.withColumn("data.values", functions.array(functions.lit(null)))

但这最终会创建data.values的新列,并且不会修改数据数组的values元素.

but this ultimately creates a new column of data.values and does not modify the values element of the data array.

推荐答案

从Spark 1.6开始,您可以使用案例类来映射您的数据框(称为数据集).然后,您可以映射数据并将其转换为所需的新架构.例如:

Since Spark 1.6, you can use case classes to map your dataframes (called datasets). Then, you can map your data and transform it to the new schema you want. For example:

case class Root(name: String, data: Seq[Data])
case class Data(name: String, values: Map[String, String])
case class NullableRoot(name: String, data: Seq[NullableData])
case class NullableData(name: String, value: Map[String, String], values: Map[String, String])

val nullableDF = df.as[Root].map { root =>
  val nullableData = root.data.map(data => NullableData(data.name, null, data.values))
  NullableRoot(root.name, nullableData)
}.toDF()

nullableDF的结果架构为:

root
 |-- name: string (nullable = true)
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- values: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)

这篇关于如何使用复杂的嵌套结构修改Spark Dataframe?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆