PySpark 将“map"类型的列转换为数据框中的多列 [英] PySpark converting a column of type 'map' to multiple columns in a dataframe

查看：55 发布时间：2021/11/12 5:38:04 python apache-spark dataframe pyspark apache-spark-sql

本文介绍了PySpark 将“map"类型的列转换为数据框中的多列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

输入

我有一列 Parameters 类型为 map 的形式:

<预><代码>>>>从 pyspark.sql 导入 SQLContext>>>sqlContext = SQLContext(sc)>>>d = [{'参数':{'foo':'1'，'bar':'2'，'baz':'aaa'}}]>>>df = sqlContext.createDataFrame(d)>>>df.collect()[行(参数={'foo':'1'，'bar':'2'，'baz':'aaa'})]

输出

我想在 pyspark 中重塑它，以便所有的键(foo、bar 等)都是列，即:

[Row(foo='1', bar='2', baz='aaa')]

使用 withColumn 工作:

(df.withColumn('foo', df.Parameters['foo']).withColumn('bar', df.Parameters['bar']).withColumn('baz', df.Parameters['baz']).drop('参数')).搜集()

但是我需要一个没有明确提及列名的解决方案，因为我有几十个列名.

架构

<预><代码>>>>df.printSchema()根|-- 参数:map (nullable = true)||-- 键:字符串||-- 值:字符串(valueContainsNull = true)

解决方案

由于 MapType 的键不是架构的一部分，因此您必须首先收集它们，例如:

from pyspark.sql.functions import 爆炸键 = (df.select(explode("参数")).select("键").清楚的().rdd.flatMap(lambda x: x).搜集())

当您拥有这些后，剩下的就是简单的选择:

from pyspark.sql.functions import colexprs = [col("Parameters").getItem(k).alias(k) for k in keys]df.select(*exprs)

Input

I have a column Parameters of type map of the form:

>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}]
>>> df = sqlContext.createDataFrame(d)
>>> df.collect()
[Row(Parameters={'foo': '1', 'bar': '2', 'baz': 'aaa'})]

Output

I want to reshape it in pyspark so that all the keys (foo, bar, etc.) are columns, namely:

[Row(foo='1', bar='2', baz='aaa')]

Using withColumn works:

(df
 .withColumn('foo', df.Parameters['foo'])
 .withColumn('bar', df.Parameters['bar'])
 .withColumn('baz', df.Parameters['baz'])
 .drop('Parameters')
).collect()

But I need like a solution that doesn't explicitly mention the column names as I have dozens of them.

Schema

>>> df.printSchema()

root
 |-- Parameters: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

解决方案

Since keys of the MapType are not a part of the schema you'll have to collect these first for example like this:

from pyspark.sql.functions import explode

keys = (df
    .select(explode("Parameters"))
    .select("key")
    .distinct()
    .rdd.flatMap(lambda x: x)
    .collect())

When you have this all what is left is simple select:

from pyspark.sql.functions import col

exprs = [col("Parameters").getItem(k).alias(k) for k in keys]
df.select(*exprs)

这篇关于PySpark 将“map"类型的列转换为数据框中的多列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark 将“map"类型的列转换为数据框中的多列 [英] PySpark converting a column of type 'map' to multiple columns in a dataframe

问题描述

输入

输出

架构

Input

Output

Schema

相关文章

Python最新文章

热门教程

热门工具

登录关闭

PySpark 将“map"类型的列转换为数据框中的多列 [英] PySpark converting a column of type &#39;map&#39; to multiple columns in a dataframe

问题描述

输入

输出

架构

Input

Output

Schema

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

PySpark 将“map"类型的列转换为数据框中的多列 [英] PySpark converting a column of type 'map' to multiple columns in a dataframe

登录关闭