基于类似于 np.where 的字典替换 spark 数据框中的列值 [英] replace column values in spark dataframe based on dictionary similar to np.where
问题描述
我的数据框看起来像 -
My data frame looks like -
no city amount
1 Kenora 56%
2 Sudbury 23%
3 Kenora 71%
4 Sudbury 41%
5 Kenora 33%
6 Niagara 22%
7 Hamilton 88%
它由 92M 记录组成.我希望我的数据框看起来像 -
It consist of 92M records. I want my data frame looks like -
no city amount new_city
1 Kenora 56% X
2 Niagara 23% X
3 Kenora 71% X
4 Sudbury 41% Sudbury
5 Ottawa 33% Ottawa
6 Niagara 22% X
7 Hamilton 88% Hamilton
使用 python 我可以管理它(使用 np.where
)但在 pyspark 中没有得到任何结果.有什么帮助吗?
Using python I can manage it(using np.where
) but not getting any results in pyspark. Any help?
到目前为止我已经完成了 -
I have done so far -
#create dictionary
city_dict = {'Kenora':'X','Niagara':'X'}
mapping_expr = create_map([lit(x) for x in chain(*city_dict .items())])
#lookup and replace
df= df.withColumn('new_city', mapping_expr[df['city']])
#But it gives me wrong results.
df.groupBy('new_city').count().show()
new_city count
X 2
null 3
为什么给我空值?
推荐答案
问题是mapping_expr
将返回null
任何未包含在city_dict
中的城市.快速解决方法是使用 coalesce
返回 city
如果 mapping_expr
返回 null
值:
The problem is that mapping_expr
will return null
for any city that is not contained in city_dict
. A quick fix is to use coalesce
to return the city
if the mapping_expr
returns a null
value:
from pyspark.sql.functions import coalesce
#lookup and replace
df1= df.withColumn('new_city', coalesce(mapping_expr[df['city']], df['city']))
df1.show()
#+---+--------+------+--------+
#| no| city|amount|new_city|
#+---+--------+------+--------+
#| 1| Kenora| 56%| X|
#| 2| Sudbury| 23%| Sudbury|
#| 3| Kenora| 71%| X|
#| 4| Sudbury| 41%| Sudbury|
#| 5| Kenora| 33%| X|
#| 6| Niagara| 22%| X|
#| 7|Hamilton| 88%|Hamilton|
#+---+--------+------+--------+
df1.groupBy('new_city').count().show()
#+--------+-----+
#|new_city|count|
#+--------+-----+
#| X| 4|
#|Hamilton| 1|
#| Sudbury| 2|
#+--------+-----+
然而,如果替换值之一为 null
,上述方法将失败.
The above method will fail, however, if one of the replacement values is null
.
在这种情况下,更简单的替代方法可能是使用 pyspark.sql.DataFrame.replace()
:
In this case, an easier alternative may be to use pyspark.sql.DataFrame.replace()
:
首先使用 withColumn
创建 new_city
作为 city
列中值的副本.
First use withColumn
to create new_city
as a copy of the values from the city
column.
df.withColumn("new_city", df["city"])\
.replace(to_replace=city_dict.keys(), value=city_dict.values(), subset="new_city")\
.groupBy('new_city').count().show()
#+--------+-----+
#|new_city|count|
#+--------+-----+
#| X| 4|
#|Hamilton| 1|
#| Sudbury| 2|
#+--------+-----+
这篇关于基于类似于 np.where 的字典替换 spark 数据框中的列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!