pyspark:自动填充隐式缺失值 [英] pyspark: Auto filling in implicit missing values

查看:116
本文介绍了pyspark:自动填充隐式缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框

user day amount
a 2 10
a 1 14
a 4 5
b 1 4

你看,day的最大值是4,最小值是1.我要为amount0列在所有用户的所有缺失天数中,因此上述数据框将变为.

You see that, the maximum value of day is 4, and the minimum value is 1. I want to fill 0 for amount column in all missing days of all users, so the above data frame will become.

user day amount
    a 2 10
    a 1 14
    a 4 5
    a 3 0
    b 1 4
    b 2 0
    b 3 0
    b 4 0

我怎么能在 PySpark 中做到这一点?非常感谢.

How could I do that in PySpark? Many thanks.

推荐答案

这是一种方法.您可以先获取最小值和最大值,然后在 user 列和枢轴上分组,然后填充缺失的列并将所有空值填充为 0,然后将它们堆叠回去:

Here is one approach. You can get the min and max values first , then group on user column and pivot, then fill in missing columns and fill all nulls as 0, then stack them back:

min_max = df.agg(F.min("day"),F.max("day")).collect()[0]
df1 = df.groupBy("user").pivot("day").agg(F.first("amount").alias("amount")).na.fill(0)

missing_cols = [F.lit(0).alias(str(i)) for i in range(min_max[0],min_max[1]+1) 
                                                if str(i) not in df1.columns ]
df1 = df1.select("*",*missing_cols)

#+----+---+---+---+---+
#|user|  1|  2|  4|  3|
#+----+---+---+---+---+
#|   b|  4|  0|  0|  0|
#|   a| 14| 10|  5|  0|
#+----+---+---+---+---+

#the next step is inspired from https://stackoverflow.com/a/37865645/9840637
arr = F.explode(F.array([F.struct(F.lit(c).alias("day"), F.col(c).alias("amount"))
                                           for c in df1.columns[1:]])).alias("kvs")
(df1.select(["user"] + [arr])
    .select(["user"]+ ["kvs.day", "kvs.amount"]).orderBy("user")).show()

<小时>

+----+---+------+
|user|day|amount|
+----+---+------+
|   a|  1|    14|
|   a|  2|    10|
|   a|  4|     5|
|   a|  3|     0|
|   b|  1|     4|
|   b|  2|     0|
|   b|  4|     0|
|   b|  3|     0|
+----+---+------+

请注意,由于列日期被旋转,dtype 可能已更改,因此您可能必须cast 将它们恢复为原始 dtype

Note, since column day was pivotted , the dtype might have changed so you may have to cast them back to the original dtype

这篇关于pyspark:自动填充隐式缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆