将地图/字典的 Spark Dataframe 列展平为多列 [英] Flatten Spark Dataframe column of map/dictionary into multiple columns
问题描述
我们有一个 DataFrame
看起来像这样:
DataFrame[事件:字符串,属性:地图<字符串,字符串>]
请注意,有两列:event
和 properties
.我们如何根据 map
中的键值将 properties
列拆分或展平为多列?
我注意到我可以这样做:
newDf = df.withColumn("foo", col("properties")["foo"])
产生
的Dataframe
DataFrame[event: string, properties: map, foo: String]
但是我必须对所有的键一一执行这些操作.有没有办法自动完成所有这些?例如,如果properties
中有foo
、bar
、baz
作为键,我们可以将其展平吗?地图
:
DataFrame[event: string, foo: String, bar: String, baz: String]
您可以使用 explode()
函数 - 它通过创建两个额外的列来展平地图 - key
和每个条目的 value
:
如果您有一列具有可分组依据的唯一值,则可以使用数据透视表.例如:
df.withColumn('id', monotonically_increasing_id()) \.select('id', 'event',explode('properties')) \.groupBy('id', 'event').pivot('key').agg(first('value'))
We have a DataFrame
that looks like this:
DataFrame[event: string, properties: map<string,string>]
Notice that there are two columns: event
and properties
. How do we split or flatten the properties
column into multiple columns based on the key values in the map
?
I notice I can do something like this:
newDf = df.withColumn("foo", col("properties")["foo"])
which produce a Dataframe
of
DataFrame[event: string, properties: map<string,string>, foo: String]
But then I would have to do these for all the keys one by one. Is there a way to do them all automatically? For example, if there are foo
, bar
, baz
as the keys in the properties
, can we flatten the map
:
DataFrame[event: string, foo: String, bar: String, baz: String]
You can use explode()
function - it flattens the map by creating two additional columns - key
and value
for each entry:
>>> df.printSchema()
root
|-- event: string (nullable = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
>>> df.select('event', explode('properties')).printSchema()
root
|-- event: string (nullable = true)
|-- key: string (nullable = false)
|-- value: string (nullable = true)
You can use pivot if you have a column with unique value you can group by. For example:
df.withColumn('id', monotonically_increasing_id()) \
.select('id', 'event', explode('properties')) \
.groupBy('id', 'event').pivot('key').agg(first('value'))
这篇关于将地图/字典的 Spark Dataframe 列展平为多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!