pyspark在数据框中使用null替换多个值 [英] pyspark replace multiple values with null in dataframe
问题描述
我有一个数据框(df),并且在该数据框内我有一列 user_id
I have a dataframe (df) and within the dataframe I have a column user_id
df = sc.parallelize([(1, "not_set"),
(2, "user_001"),
(3, "user_002"),
(4, "n/a"),
(5, "N/A"),
(6, "userid_not_set"),
(7, "user_003"),
(8, "user_004")]).toDF(["key", "user_id"])
df:
+---+--------------+
|key| user_id|
+---+--------------+
| 1| not_set|
| 2| user_003|
| 3| user_004|
| 4| n/a|
| 5| N/A|
| 6|userid_not_set|
| 7| user_003|
| 8| user_004|
+---+--------------+
我想用空值替换以下值: not_set,n/a,N/A和userid_not_set .
I would like to replace the following values: not_set, n/a, N/A and userid_not_set with null.
如果我可以将任何新值添加到列表中并且可以更改它们,那将是很好的.
It would be good if I could add any new values to a list and they to could be changed.
我目前正在 spark.sql 中使用CASE语句来执行此操作,并希望将其更改为pyspark.
I am currently using a CASE statement within spark.sql to preform this and would like to change this to pyspark.
推荐答案
when()
函数内的
None
对应于null
.如果您希望填写其他内容而不是null
,则必须在其位置处填写.
None
inside the when()
function corresponds to the null
. In case you wish to fill in anything else instead of null
, you have to fill it in it's place.
from pyspark.sql.functions import col
df = df.withColumn(
"user_id",
when(
col("user_id").isin('not_set', 'n/a', 'N/A', 'userid_not_set'),
None
).otherwise(col("user_id"))
)
df.show()
+---+--------+
|key| user_id|
+---+--------+
| 1| null|
| 2|user_001|
| 3|user_002|
| 4| null|
| 5| null|
| 6| null|
| 7|user_003|
| 8|user_004|
+---+--------+
这篇关于pyspark在数据框中使用null替换多个值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!