pyspark在数据框中使用null替换多个值 [英] pyspark replace multiple values with null in dataframe

查看:243
本文介绍了pyspark在数据框中使用null替换多个值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框(df),并且在该数据框内我有一列 user_id

I have a dataframe (df) and within the dataframe I have a column user_id

df = sc.parallelize([(1, "not_set"),
                     (2, "user_001"),
                     (3, "user_002"),
                     (4, "n/a"),
                     (5, "N/A"),
                     (6, "userid_not_set"),
                     (7, "user_003"),
                     (8, "user_004")]).toDF(["key", "user_id"])

df:

+---+--------------+
|key|       user_id|
+---+--------------+
|  1|       not_set|
|  2|      user_003|
|  3|      user_004|
|  4|           n/a|
|  5|           N/A|
|  6|userid_not_set|
|  7|      user_003|
|  8|      user_004|
+---+--------------+

我想用空值替换以下值: not_set,n/a,N/A和userid_not_set .

I would like to replace the following values: not_set, n/a, N/A and userid_not_set with null.

如果我可以将任何新值添加到列表中并且可以更改它们,那将是很好的.

It would be good if I could add any new values to a list and they to could be changed.

我目前正在 spark.sql 中使用CASE语句来执行此操作,并希望将其更改为pyspark.

I am currently using a CASE statement within spark.sql to preform this and would like to change this to pyspark.

推荐答案

when()函数内的

None对应于null.如果您希望填写其他内容而不是null,则必须在其位置处填写.

None inside the when() function corresponds to the null. In case you wish to fill in anything else instead of null, you have to fill it in it's place.

from pyspark.sql.functions import col    
df =  df.withColumn(
    "user_id",
    when(
        col("user_id").isin('not_set', 'n/a', 'N/A', 'userid_not_set'),
        None
    ).otherwise(col("user_id"))
)
df.show()
+---+--------+
|key| user_id|
+---+--------+
|  1|    null|
|  2|user_001|
|  3|user_002|
|  4|    null|
|  5|    null|
|  6|    null|
|  7|user_003|
|  8|user_004|
+---+--------+

这篇关于pyspark在数据框中使用null替换多个值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆