在 Spark Dataframe 中的窗口上创建组 ID [英] Create a group id over a window in Spark Dataframe
问题描述
我有一个数据框,我想在其中提供每个 Window 分区中的 id.例如我有
I have a dataframe where I want to give id's in each Window partition. For example I have
id | col |
1 | a |
2 | a |
3 | b |
4 | c |
5 | c |
所以我想要(基于列 col 分组)
So I want (based on grouping with column col)
id | group |
1 | 1 |
2 | 1 |
3 | 2 |
4 | 3 |
5 | 3 |
我想使用窗口函数,但无论如何我找不到为每个窗口分配一个 Id.我需要类似的东西:
I want to use a window function but I cannot find anyway to assign an Id to each window. I need something like:
w = Window().partitionBy('col')
df = df.withColumn("group", id().over(w))
有什么办法可以实现这样的目标.(我不能简单地使用 col 作为组 ID,因为我有兴趣在多列上创建一个窗口)
Is there any way to achive somethong like that. (I cannot simply use col as a group id because I am interested in creating a window over multiple columns)
推荐答案
简单地使用 dense_rank
inbuilt function over Window function 应该给你你想要的结果
Simply using a dense_rank
inbuilt function over Window function should give you your desired result as
from pyspark.sql import window as W
import pyspark.sql.functions as f
df.select('id', f.dense_rank().over(W.Window.orderBy('col')).alias('group')).show(truncate=False)
应该给你
+---+-----+
|id |group|
+---+-----+
|1 |1 |
|2 |1 |
|3 |2 |
|4 |3 |
|5 |3 |
+---+-----+
这篇关于在 Spark Dataframe 中的窗口上创建组 ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!