基于条件的Pyspark枢轴数据框 [英] Pyspark pivot data frame based on condition

查看:54
本文介绍了基于条件的Pyspark枢轴数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 pyspark 中有一个数据框,如下所示.

I have a data frame in pyspark like below.

df.show()

+---+-------+----+
| id|   type|s_id|
+---+-------+----+
|  1|    ios|  11|
|  1|    ios|  12|
|  1|    ios|  13|
|  1|    ios|  14|
|  1|android|  15|
|  1|android|  16|
|  1|android|  17|
|  2|    ios|  21|
|  2|android|  18|
+---+-------+----+

现在,我想通过旋转该数据框来创建另一个数据框.

Now from this data frame I want to create another data frame by pivoting it.

df1.show()
+---+-----+-----+-----+---------+---------+---------+
| id| ios1| ios2| ios3| android1| android2| android3|
+---+-----+-----+-----+---------+---------+---------+
|  1|   11|   12|   13|       15|       16|       17|
|  2|   21| Null| Null|       18|     Null|     Null|
+---+-----+-----+-----+---------+---------+---------+

在这里,我需要考虑一个条件,即即使每个 3 types 种类都多,我也只考虑 3或小于3 .

Here I need to consider a condition that for each Id even though there will be more than 3 types I want to consider only 3 or less than 3.

我该怎么做?

编辑

new_df.show()

+---+-------+----+
| id|   type|s_id|
+---+-------+----+
|  1|    ios|  11|
|  1|    ios|  12|
|  1|       |  13|
|  1|       |  14|
|  1|andriod|  15|
|  1|       |  16|
|  1|       |  17|
|  2|andriod|  18|
|  2|    ios|  21|
+---+-------+----+

我得到的结果如下

+---+----+----+----+--------+----+----+
| id|   1|   2|   3|andriod1|ios1|ios2|
+---+----+----+----+--------+----+----+
|  1|  13|  14|  16|      15|  11|  12|
|  2|null|null|null|      18|  21|null|
+---+----+----+----+--------+----+----+

我想要的是

+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1  |15      |    null|    null|  11|  12|null|
|2  |18      |    null|    null|  21|null|null|
+---+--------+--------+--------+----+----+----+

推荐答案

使用以下逻辑应能获得所需的结果.

Using the following logic should get you your desired result.

Window 函数用于为按排序的每组 id type 组生成行号> s_id .生成的行号用于 type 过滤器 concat .最后,分组和旋转应该会为您提供所需的输出

Window function is used to generate row number for each group of id and type ordered by s_id. Generated row number is used to filter and concat with type. Then finally grouping and pivoting should give you your desired output

from pyspark.sql import Window 
windowSpec = Window.partitionBy("id", "type").orderBy("s_id")

from pyspark.sql import functions as f

df.withColumn("ranks", f.row_number().over(windowSpec))\
    .filter(f.col("ranks") < 4)\
    .withColumn("type", f.concat(f.col("type"), f.col("ranks")))\
    .drop("ranks")\
    .groupBy("id")\
    .pivot("type")\
    .agg(f.first("s_id"))\
    .show(truncate=False)

应该给您

+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1  |15      |16      |17      |11  |12  |13  |
|2  |18      |null    |null    |21  |null|null|
+---+--------+--------+--------+----+----+----+

已编辑部分的答案

您只是需要一个额外的过滤器

df.withColumn("ranks", f.row_number().over(windowSpec)) \
    .filter(f.col("ranks") < 4) \
    .filter(f.col("type") != "") \
    .withColumn("type", f.concat(f.col("type"), f.col("ranks"))) \
    .drop("ranks") \
    .groupBy("id") \
    .pivot("type") \
    .agg(f.first("s_id")) \
    .show(truncate=False)

可以给您

+---+--------+----+----+
|id |andriod1|ios1|ios2|
+---+--------+----+----+
|1  |15      |11  |12  |
|2  |18      |21  |null|
+---+--------+----+----+

现在,此数据框缺少 android2,android3和ios3 列.因为它们不存在于更新的输入数据中.您可以使用withColumn API添加它们并填充空值

Now this dataframe lacks android2, android3 and ios3 columns. Because they are not present in your updated input data. you can add them using withColumn api and populate null values

这篇关于基于条件的Pyspark枢轴数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆