基于条件的Pyspark枢轴数据框 [英] Pyspark pivot data frame based on condition
问题描述
我在 pyspark
中有一个数据框,如下所示.
I have a data frame in pyspark
like below.
df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| ios| 13|
| 1| ios| 14|
| 1|android| 15|
| 1|android| 16|
| 1|android| 17|
| 2| ios| 21|
| 2|android| 18|
+---+-------+----+
现在,我想通过旋转该数据框来创建另一个数据框.
Now from this data frame I want to create another data frame by pivoting it.
df1.show()
+---+-----+-----+-----+---------+---------+---------+
| id| ios1| ios2| ios3| android1| android2| android3|
+---+-----+-----+-----+---------+---------+---------+
| 1| 11| 12| 13| 15| 16| 17|
| 2| 21| Null| Null| 18| Null| Null|
+---+-----+-----+-----+---------+---------+---------+
在这里,我需要考虑一个条件,即即使每个 3
types
种类都多,我也只考虑 3或小于3
.
Here I need to consider a condition that for each Id
even though there will be more than 3
types
I want to consider only 3 or less than 3
.
我该怎么做?
编辑
new_df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| | 13|
| 1| | 14|
| 1|andriod| 15|
| 1| | 16|
| 1| | 17|
| 2|andriod| 18|
| 2| ios| 21|
+---+-------+----+
我得到的结果如下
+---+----+----+----+--------+----+----+
| id| 1| 2| 3|andriod1|ios1|ios2|
+---+----+----+----+--------+----+----+
| 1| 13| 14| 16| 15| 11| 12|
| 2|null|null|null| 18| 21|null|
+---+----+----+----+--------+----+----+
我想要的是
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 | null| null| 11| 12|null|
|2 |18 | null| null| 21|null|null|
+---+--------+--------+--------+----+----+----+
推荐答案
使用以下逻辑应能获得所需的结果.
Using the following logic should get you your desired result.
Window
函数用于为按排序的每组
.生成的行号用于 id
和 type
组生成行号> s_id type
的过滤器
和 concat
.最后,分组和旋转应该会为您提供所需的输出
Window
function is used to generate row number for each group of id
and type
ordered by s_id
. Generated row number is used to filter
and concat
with type
. Then finally grouping and pivoting should give you your desired output
from pyspark.sql import Window
windowSpec = Window.partitionBy("id", "type").orderBy("s_id")
from pyspark.sql import functions as f
df.withColumn("ranks", f.row_number().over(windowSpec))\
.filter(f.col("ranks") < 4)\
.withColumn("type", f.concat(f.col("type"), f.col("ranks")))\
.drop("ranks")\
.groupBy("id")\
.pivot("type")\
.agg(f.first("s_id"))\
.show(truncate=False)
应该给您
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 |16 |17 |11 |12 |13 |
|2 |18 |null |null |21 |null|null|
+---+--------+--------+--------+----+----+----+
已编辑部分的答案
您只是需要一个额外的过滤器
df.withColumn("ranks", f.row_number().over(windowSpec)) \
.filter(f.col("ranks") < 4) \
.filter(f.col("type") != "") \
.withColumn("type", f.concat(f.col("type"), f.col("ranks"))) \
.drop("ranks") \
.groupBy("id") \
.pivot("type") \
.agg(f.first("s_id")) \
.show(truncate=False)
可以给您
+---+--------+----+----+
|id |andriod1|ios1|ios2|
+---+--------+----+----+
|1 |15 |11 |12 |
|2 |18 |21 |null|
+---+--------+----+----+
现在,此数据框缺少 android2,android3和ios3
列.因为它们不存在于更新的输入数据中.您可以使用withColumn API添加它们并填充空值
Now this dataframe lacks android2, android3 and ios3
columns. Because they are not present in your updated input data. you can add them using withColumn api and populate null values
这篇关于基于条件的Pyspark枢轴数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!