如何在不硬编码列名的情况下分解数据框中的结构? [英] How can I explode a struct in a dataframe without hard-coding the column names?

查看:27
本文介绍了如何在不硬编码列名的情况下分解数据框中的结构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以我的数据集之一为例.这是 df.printSchema() 的结果

Consider one of my data sets as an example below. This is the result of df.printSchema()

member: struct (nullable = true)
 |   address: struct (nullable = true)
 |    |   city: string (nullable = true)
 |    |   state: string (nullable = true)
 |    |   streetAddress: string (nullable = true)
 |    |   zipCode: string (nullable = true)
 |   birthDate: string (nullable = true)
 |   groupIdentification: string (nullable = true)
 |   memberCode: string (nullable = true)
 |   patientName: struct (nullable = true)
 |    |   first: string (nullable = true)
 |    |   last: string (nullable = true)
memberContractCode: string (nullable = true)
memberContractType: string (nullable = true)
memberProductCode: string (nullable = true)

这个数据是通过 json 读入的,我想把它弄平,这样所有的数据都在同一级别,这样我的数据框只包含原始类型,就像这样:

This data is read in via json, and I want to flatten this out so all are on the same level and so that my dataframe only contains primitive types, like so:

member.address.city: string (nullable = true)
member.address.state: string (nullable = true)
member.address.streetAddress: string (nullable = true)
member.address.zipCode: string (nullable = true)
member.birthDate: string (nullable = true)
member.groupIdentification: string (nullable = true)
member.memberCode: string (nullable = true)...

我知道这可以通过像这样手动指定列名来完成:

I know this can be done by manually specifying the column names like so:

df = df.withColumn("member.address.city", df("member.address.city")).withColumn("member.address.state", df("member.address.state"))...

但是,我将无法对所有数据集的列名称进行硬编码,因为该程序需要能够在不更改实际代码的情况下动态处理新数据集.我想创建一个可以分解任何类型结构的通用方法,因为它已经在数据帧中并且模式是已知的(但它是完整模式的子集).这在 Spark 1.6 中可行吗?如果是这样,如何

However, I won't be able to hardcode the column names like above for all of my data sets, as the program needs to be able to process new datasets on the fly without any changes to the actual code. I want to make a general method that can explode any type of structure, given that it is already in a dataframe and the schema is known (but is a subset of the full schema). Is this possible in Spark 1.6? And if so, how

推荐答案

应该这样做 - 您需要通过处理 StructType 类型的字段来迭代架构并扁平化"它与简单"字段分开:

This should do it - you'll need to iterate over the schema and "flatten" it, by handling fields of type StructType separately from "simple" fields:

// helper recursive method to "flatten" the schema:
def getFields(parent: String, schema: StructType): Seq[String] = schema.fields.flatMap {
  case StructField(name, t: StructType, _, _) => getFields(parent + name + ".", t)
  case StructField(name, _, _, _) => Seq(s"$parent$name")
}

// apply to our DF's schema:
val fields: Seq[String] = getFields("", df.schema)

// select these fields:
val result = df.select(fields.map(name => $"$name" as name): _*)

这篇关于如何在不硬编码列名的情况下分解数据框中的结构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆