pyspark从两列中的数据创建字典 [英] pyspark create dictionary from data in two columns
本文介绍了pyspark从两列中的数据创建字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个包含两列的 pyspark 数据框:
I have a pyspark dataframe with two columns:
[Row(zip_code='58542', dma='MIN'),
Row(zip_code='58701', dma='MIN'),
Row(zip_code='57632', dma='MIN'),
Row(zip_code='58734', dma='MIN')]
如何从列内的数据中创建键:值对?
How can I make a key:value pair out of the data inside the columns?
例如:
{
"58542":"MIN",
"58701:"MIN",
etc..
}
出于性能原因,我想避免使用 collect.我已经尝试了一些东西,但似乎无法获得值.
I would like to avoid using collect for performance reasons. I've tried a few things but can't seem to get just the values.
推荐答案
正如 Ankin 所说,您可以为此使用 MapType:
As Ankin says, you can use a MapType for this:
import pyspark
from pyspark.sql import Row
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
data = spark.createDataFrame([Row(zip_code='58542', dma='MIN'),
Row(zip_code='58701', dma='MIN'),
Row(zip_code='57632', dma='MIN'),
Row(zip_code='58734', dma='MIN')])
data.show()
输出:
+---+--------+
|dma|zip_code|
+---+--------+
|MIN| 58542|
|MIN| 58701|
|MIN| 57632|
|MIN| 58734|
+---+--------+
from pyspark.sql.functions import udf
from pyspark.sql import types as T
@udf(T.MapType(T.StringType(), T.StringType()))
def create_struct(zip_code, dma):
return {zip_code: dma}
data.withColumn('struct', create_struct(data.zip_code, data.dma)).toJSON().collect()
输出:
['{"dma":"MIN","zip_code":"58542","struct":{"58542":"MIN"}}',
'{"dma":"MIN","zip_code":"58701","struct":{"58701":"MIN"}}',
'{"dma":"MIN","zip_code":"57632","struct":{"57632":"MIN"}}',
'{"dma":"MIN","zip_code":"58734","struct":{"58734":"MIN"}}']
这篇关于pyspark从两列中的数据创建字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文