如何在不对列名进行硬编码的情况下爆炸数据帧中的结构? [英] How can I explode a struct in a dataframe without hard-coding the column names?

查看:91
本文介绍了如何在不对列名进行硬编码的情况下爆炸数据帧中的结构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面以我的数据集之一为例.这是df.printSchema()

Consider one of my data sets as an example below. This is the result of df.printSchema()

member: struct (nullable = true)
 |   address: struct (nullable = true)
 |    |   city: string (nullable = true)
 |    |   state: string (nullable = true)
 |    |   streetAddress: string (nullable = true)
 |    |   zipCode: string (nullable = true)
 |   birthDate: string (nullable = true)
 |   groupIdentification: string (nullable = true)
 |   memberCode: string (nullable = true)
 |   patientName: struct (nullable = true)
 |    |   first: string (nullable = true)
 |    |   last: string (nullable = true)
memberContractCode: string (nullable = true)
memberContractType: string (nullable = true)
memberProductCode: string (nullable = true)

此数据是通过json读取的,我想将其展平,以使所有数据都处于同一级别,以便我的数据框仅包含基本类型,如下所示:

This data is read in via json, and I want to flatten this out so all are on the same level and so that my dataframe only contains primitive types, like so:

member.address.city: string (nullable = true)
member.address.state: string (nullable = true)
member.address.streetAddress: string (nullable = true)
member.address.zipCode: string (nullable = true)
member.birthDate: string (nullable = true)
member.groupIdentification: string (nullable = true)
member.memberCode: string (nullable = true)...

我知道这可以通过手动指定列名来完成,例如:

I know this can be done by manually specifying the column names like so:

df = df.withColumn("member.address.city", df("member.address.city")).withColumn("member.address.state", df("member.address.state"))...

但是,我将无法为所有数据集像上面那样对列名进行硬编码,因为该程序需要能够在不更改实际代码的情况下即时处理新的数据集.我想做一个通用的方法,它可以爆炸任何类型的结构,因为它已经存在于数据帧中并且该模式是已知的(但它是完整模式的子集).在Spark 1.6中可能吗?如果是这样,

However, I won't be able to hardcode the column names like above for all of my data sets, as the program needs to be able to process new datasets on the fly without any changes to the actual code. I want to make a general method that can explode any type of structure, given that it is already in a dataframe and the schema is known (but is a subset of the full schema). Is this possible in Spark 1.6? And if so, how

推荐答案

这应该做到这一点-您需要通过分别处理类型为StructType的字段和简单"字段来遍历该模式并展平"它字段:

This should do it - you'll need to iterate over the schema and "flatten" it, by handling fields of type StructType separately from "simple" fields:

// helper recursive method to "flatten" the schema:
def getFields(parent: String, schema: StructType): Seq[String] = schema.fields.flatMap {
  case StructField(name, t: StructType, _, _) => getFields(parent + name + ".", t)
  case StructField(name, _, _, _) => Seq(s"$parent$name")
}

// apply to our DF's schema:
val fields: Seq[String] = getFields("", df.schema)

// select these fields:
val result = df.select(fields.map(name => $"$name" as name): _*)

这篇关于如何在不对列名进行硬编码的情况下爆炸数据帧中的结构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆