将地图/字典的 Spark Dataframe 列展平为多列 [英] Flatten Spark Dataframe column of map/dictionary into multiple columns

查看:115
本文介绍了将地图/字典的 Spark Dataframe 列展平为多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个 DataFrame 看起来像这样:

DataFrame[事件:字符串,属性:地图<字符串,字符串>]

请注意,有两列:eventproperties.我们如何根据 map 中的键值将 properties 列拆分或展平为多列?

<小时>

我注意到我可以这样做:

newDf = df.withColumn("foo", col("properties")["foo"])

产生

Dataframe

DataFrame[event: string, properties: map, foo: String]

但是我必须对所有的键一一执行这些操作.有没有办法自动完成所有这些?例如,如果properties中有foobarbaz作为键,我们可以将其展平吗?地图:

DataFrame[event: string, foo: String, bar: String, baz: String]

解决方案

您可以使用 explode() 函数 - 它通过创建两个额外的列来展平地图 - key和每个条目的 value:

<预><代码>>>>df.printSchema()根|-- 事件:字符串(可为空 = 真)|-- 属性:地图(可为空 = 真)||-- 键:字符串||-- 值:字符串(valueContainsNull = true)>>>df.select('event',explode('properties')).printSchema()根|-- 事件:字符串(可为空 = 真)|-- 键:字符串(可为空 = false)|-- 值:字符串(可为空 = 真)

如果您有一列具有可分组依据的唯一值,则可以使用数据透视表.例如:

df.withColumn('id', monotonically_increasing_id()) \.select('id', 'event',explode('properties')) \.groupBy('id', 'event').pivot('key').agg(first('value'))

We have a DataFrame that looks like this:

DataFrame[event: string, properties: map<string,string>]

Notice that there are two columns: event and properties. How do we split or flatten the properties column into multiple columns based on the key values in the map?


I notice I can do something like this:

newDf = df.withColumn("foo", col("properties")["foo"])

which produce a Dataframe of

DataFrame[event: string, properties: map<string,string>, foo: String]

But then I would have to do these for all the keys one by one. Is there a way to do them all automatically? For example, if there are foo, bar, baz as the keys in the properties, can we flatten the map:

DataFrame[event: string, foo: String, bar: String, baz: String]

解决方案

You can use explode() function - it flattens the map by creating two additional columns - key and value for each entry:

>>> df.printSchema()
root
 |-- event: string (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

>>> df.select('event', explode('properties')).printSchema()
root
 |-- event: string (nullable = true)
 |-- key: string (nullable = false)
 |-- value: string (nullable = true)

You can use pivot if you have a column with unique value you can group by. For example:

df.withColumn('id', monotonically_increasing_id()) \
    .select('id', 'event', explode('properties')) \
    .groupBy('id', 'event').pivot('key').agg(first('value'))

这篇关于将地图/字典的 Spark Dataframe 列展平为多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆