PySpark使用字典中的映射创建新列 [英] PySpark create new column with mapping from a dict

查看：628 发布时间：2020/5/5 13:26:42 python apache-spark dictionary pyspark apache-spark-sql

本文介绍了PySpark使用字典中的映射创建新列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用Spark 1.6，我有一个Spark DataFrame column(命名为col1)，其值分别为A，B，C，DS，DNS，E，F，G和H，我想创建一个新列(说col2)和下面dict中的值，我该如何映射? (因此f.i.'A'需要映射为'S'等.)

Using Spark 1.6, I have a Spark DataFrame column (named let's say col1) with values A, B, C, DS, DNS, E, F, G and H and I want to create a new column (say col2) with the values from the dict here below, how do I map this? (so f.i. 'A' needs to be mapped to 'S' etc..)

dict = {'A': 'S', 'B': 'S', 'C': 'S', 'DS': 'S', 'DNS': 'S', 'E': 'NS', 'F': 'NS', 'G': 'NS', 'H': 'NS'}

推荐答案

使用UDF(与版本无关)的低效率解决方案:

Inefficient solution with UDF (version independent):

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

def translate(mapping):
    def translate_(col):
        return mapping.get(col)
    return udf(translate_, StringType())

df = sc.parallelize([('DS', ), ('G', ), ('INVALID', )]).toDF(['key'])
mapping = {
    'A': 'S', 'B': 'S', 'C': 'S', 'DS': 'S', 'DNS': 'S', 
    'E': 'NS', 'F': 'NS', 'G': 'NS', 'H': 'NS'}

df.withColumn("value", translate(mapping)("key"))

结果:

+-------+-----+
|    key|value|
+-------+-----+
|     DS|    S|
|      G|   NS|
|INVALID| null|
+-------+-----+

创建MapType文字的效率更高( Spark> = 2.0，Spark< 3.0 ):

Much more efficient (Spark >= 2.0, Spark < 3.0) is to create a MapType literal:

from pyspark.sql.functions import col, create_map, lit
from itertools import chain

mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])

df.withColumn("value", mapping_expr.getItem(col("key")))

具有相同的结果:

+-------+-----+
|    key|value|
+-------+-----+
|     DS|    S|
|      G|   NS|
|INVALID| null|
+-------+-----+

但更有效的执行计划:

== Physical Plan ==
*Project [key#15, keys: [B,DNS,DS,F,E,H,C,G,A], values: [S,S,S,NS,NS,NS,S,NS,S][key#15] AS value#53]
+- Scan ExistingRDD[key#15]

与UDF版本相比:

== Physical Plan ==
*Project [key#15, pythonUDF0#61 AS value#57]
+- BatchEvalPython [translate_(key#15)], [key#15, pythonUDF0#61]
   +- Scan ExistingRDD[key#15]

在 Spark> = 3.0 中，getItem应该替换为__getitem__([])，即:

In Spark >= 3.0 getItem should be replaced with __getitem__ ([]), i.e:

df.withColumn("value", mapping_expr[col("key")]).show()

这篇关于PySpark使用字典中的映射创建新列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark使用字典中的映射创建新列 [英] PySpark create new column with mapping from a dict

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

PySpark使用字典中的映射创建新列 [英] PySpark create new column with mapping from a dict

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭