Pyspark:通过搜索字典替换列中的值 [英] Pyspark: Replacing value in a column by searching a dictionary
本文介绍了Pyspark:通过搜索字典替换列中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我是PySpark的新手.
I'm a newbie in PySpark.
我有一个Spark DataFrame
df
,其列为"device_type".
I have a Spark DataFrame
df
that has a column 'device_type'.
我想将平板电脑"或电话"中的每个值替换为电话",并将"PC"替换为桌面".
I want to replace every value that is in "Tablet" or "Phone" to "Phone", and replace "PC" to "Desktop".
在Python中,我可以执行以下操作
In Python I can do the following,
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df['device_type'] = df['device_type'].replace(deviceDict,inplace=False)
如何使用PySpark做到这一点?谢谢!
How can I achieve this using PySpark? Thanks!
推荐答案
您可以使用na.replace
:
df = spark.createDataFrame([
('Tablet', ), ('Phone', ), ('PC', ), ('Other', ), (None, )
], ["device_type"])
df.na.replace(deviceDict, 1).show()
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| null|
+-----------+
或地图文字:
from itertools import chain
from pyspark.sql.functions import create_map, lit
mapping = create_map([lit(x) for x in chain(*deviceDict.items())])
df.select(mapping[df['device_type']].alias('device_type'))
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| null|
| null|
+-----------+
请注意,后一种解决方案会将映射中不存在的值转换为NULL
.如果这不是您想要的行为,则可以添加coalesce
:
Please note that the latter solution will convert values not present in the mapping to NULL
. If this is not a desired behavior you can add coalesce
:
from pyspark.sql.functions import coalesce
df.select(
coalesce(mapping[df['device_type']], df['device_type']).alias('device_type')
)
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| null|
+-----------+
这篇关于Pyspark:通过搜索字典替换列中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文