在 pyspark 的 collect_list 中包含空值 [英] Include null values in collect_list in pyspark
问题描述
我试图在使用 pyspark
时在 collect_list
中包含空值,但是 collect_list
操作排除了 nulls
.我查看了以下帖子 Pypsark- 使用 collect_list 时保留空值.但是,给出的答案不是我想要的.
I am trying to include null values in collect_list
while using pyspark
, however the collect_list
operation excludes nulls
. I have looked into the following post Pypsark - Retain null values when using collect_list . However, the answer given is not what I am looking for.
我有一个这样的数据帧 df
.
I have a dataframe df
like this.
| id | family | date |
----------------------------
| 1 | Prod | null |
| 2 | Dev | 2019-02-02 |
| 3 | Prod | 2017-03-08 |
这是我目前的代码:
df.groupby("family").agg(f.collect_list("date").alias("entry_date"))
这给了我这样的输出:
| family | date |
-----------------------
| Prod |[2017-03-08]|
| Dev |[2019-02-02]|
我真正想要的是:
| family | date |
-----------------------------
| Prod |[null, 2017-03-08]|
| Dev |[2019-02-02] |
有人可以帮我解决这个问题吗?谢谢!
Can someone please help me with this? Thank you!
推荐答案
一个可能的解决方法是用另一个值替换所有空值.(也许不是最好的方法,但它仍然是一个解决方案)
A possible workaround for this could be to replace all null-values with another value. (Perhaps not the best way to do this, but it's a solution nonetheless)
df = df.na.fill("my_null") # Replace null with "my_null"
df = df.groupby("family").agg(f.collect_list("date").alias("entry_date"))
应该给你:
| family | date |
-----------------------------
| Prod |[my_null, 2017-03-08]|
| Dev |[2019-02-02] |
这篇关于在 pyspark 的 collect_list 中包含空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!