PySpark 将“map"类型的列转换为数据框中的多列 [英] PySpark converting a column of type 'map' to multiple columns in a dataframe
本文介绍了PySpark 将“map"类型的列转换为数据框中的多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
输入
我有一列 Parameters
类型为 map
的形式:
输出
我想在 pyspark 中重塑它,以便所有的键(foo
、bar
等)都是列,即:
[Row(foo='1', bar='2', baz='aaa')]
使用 withColumn
工作:
(df.withColumn('foo', df.Parameters['foo']).withColumn('bar', df.Parameters['bar']).withColumn('baz', df.Parameters['baz']).drop('参数')).搜集()
但是我需要一个没有明确提及列名的解决方案,因为我有几十个列名.
架构
<预><代码>>>>df.printSchema()根|-- 参数:map (nullable = true)||-- 键:字符串||-- 值:字符串(valueContainsNull = true) 解决方案
由于 MapType
的键不是架构的一部分,因此您必须首先收集它们,例如:
from pyspark.sql.functions import 爆炸键 = (df.select(explode("参数")).select("键").清楚的().rdd.flatMap(lambda x: x).搜集())
当您拥有这些后,剩下的就是简单的选择:
from pyspark.sql.functions import colexprs = [col("Parameters").getItem(k).alias(k) for k in keys]df.select(*exprs)
Input
I have a column Parameters
of type map
of the form:
>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}]
>>> df = sqlContext.createDataFrame(d)
>>> df.collect()
[Row(Parameters={'foo': '1', 'bar': '2', 'baz': 'aaa'})]
Output
I want to reshape it in pyspark so that all the keys (foo
, bar
, etc.) are columns, namely:
[Row(foo='1', bar='2', baz='aaa')]
Using withColumn
works:
(df
.withColumn('foo', df.Parameters['foo'])
.withColumn('bar', df.Parameters['bar'])
.withColumn('baz', df.Parameters['baz'])
.drop('Parameters')
).collect()
But I need like a solution that doesn't explicitly mention the column names as I have dozens of them.
Schema
>>> df.printSchema()
root
|-- Parameters: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
解决方案
Since keys of the MapType
are not a part of the schema you'll have to collect these first for example like this:
from pyspark.sql.functions import explode
keys = (df
.select(explode("Parameters"))
.select("key")
.distinct()
.rdd.flatMap(lambda x: x)
.collect())
When you have this all what is left is simple select:
from pyspark.sql.functions import col
exprs = [col("Parameters").getItem(k).alias(k) for k in keys]
df.select(*exprs)
这篇关于PySpark 将“map"类型的列转换为数据框中的多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文