Pyspark:通过搜索字典替换列中的值 [英] Pyspark: Replacing value in a column by searching a dictionary

查看:368
本文介绍了Pyspark:通过搜索字典替换列中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是PySpark的新手.

I'm a newbie in PySpark.

我有一个Spark DataFrame df,其列为"device_type".

I have a Spark DataFrame df that has a column 'device_type'.

我想将平板电脑"或电话"中的每个值替换为电话",并将"PC"替换为桌面".

I want to replace every value that is in "Tablet" or "Phone" to "Phone", and replace "PC" to "Desktop".

在Python中,我可以执行以下操作

In Python I can do the following,

deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df['device_type'] = df['device_type'].replace(deviceDict,inplace=False)

如何使用PySpark做到这一点?谢谢!

How can I achieve this using PySpark? Thanks!

推荐答案

您可以使用na.replace:

df = spark.createDataFrame([
    ('Tablet', ), ('Phone', ),  ('PC', ), ('Other', ), (None, )
], ["device_type"])

df.na.replace(deviceDict, 1).show()

+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|      Other|
|       null|
+-----------+

或地图文字:

from itertools import chain
from pyspark.sql.functions import create_map, lit

mapping = create_map([lit(x) for x in chain(*deviceDict.items())])


df.select(mapping[df['device_type']].alias('device_type'))

+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|       null|
|       null|
+-----------+

请注意,后一种解决方案会将映射中不存在的值转换为NULL.如果这不是您想要的行为,则可以添加coalesce:

Please note that the latter solution will convert values not present in the mapping to NULL. If this is not a desired behavior you can add coalesce:

from pyspark.sql.functions import coalesce


df.select(
    coalesce(mapping[df['device_type']], df['device_type']).alias('device_type')
)

+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|      Other|
|       null|
+-----------+

这篇关于Pyspark:通过搜索字典替换列中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆