为 Spark 数据帧中的每个组创建索引 [英] Creating indices for each group in Spark dataframe
问题描述
我在 Spark 中有一个数据框,有 2 列,group_id
和 value
,其中 value
是双精度值.我想根据group_id
对数据进行分组,按value
对每组进行排序,然后添加代表位置的第三列index
value
按组的值排序.
I have a dataframe in Spark with 2 columns, group_id
and value
, where value
is a double. I would like to group the data based on the group_id
, order each group by value
, and then add a third column index
that represents the position of value
in the ordering of values for the group.
例如,考虑以下输入数据:
For example, considering the following input data:
+--------+-----+
|group_id|value|
+--------+-----+
|1 |1.3 |
|2 |0.8 |
|1 |3.4 |
|1 |-1.7 |
|2 |2.3 |
|2 |5.9 |
|1 |2.7 |
|1 |0.0 |
+--------+-----+
输出将类似于
+--------+-----+-----+
|group_id|value|index|
+--------+-----+-----+
|1 |-1.7 |1 |
|1 |0.0 |2 |
|1 |1.3 |3 |
|1 |2.7 |4 |
|1 |3.4 |5 |
|2 |0.8 |1 |
|2 |2.3 |2 |
|2 |5.9 |3 |
+--------+-----+-----+
索引是否从 0 开始以及排序是升序还是降序并不重要.
It is unimportant if the index is 0-based and whether the sort is ascending or descending.
作为后续,考虑在原始数据中有第三列 extra
的情况,该列对某些 (group_id, value)
组合.一个例子是:
As a follow-up, consider the case where there is a third column, extra
, in the original data that takes on multiple values for some (group_id, value)
combinations. An example is:
+--------+-----+-----+
|group_id|value|extra|
+--------+-----+-----+
|1 |1.3 |1 |
|1 |1.3 |2 |
|2 |0.8 |1 |
|1 |3.4 |1 |
|1 |3.4 |2 |
|1 |3.4 |3 |
|1 |-1.7 |1 |
|2 |2.3 |1 |
|2 |5.9 |1 |
|1 |2.7 |1 |
|1 |0.0 |1 |
+--------+-----+-----+
有没有办法添加一个 index
列,这样 extra
列不被考虑但仍然保留?在这种情况下的输出将是
Is there a way to add an index
column such that the extra
column is not considered but still kept? The output in this case would be
+--------+-----+-----+-----+
|group_id|value|extra|index|
+--------+-----+-----+-----+
|1 |-1.7 |1 |1 |
|1 |0.0 |1 |2 |
|1 |1.3 |1 |3 |
|1 |1.3 |2 |3 |
|1 |2.7 |1 |4 |
|1 |3.4 |1 |5 |
|1 |3.4 |2 |5 |
|1 |3.4 |3 |5 |
|2 |0.8 |1 |1 |
|2 |2.3 |1 |2 |
|2 |5.9 |1 |3 |
+--------+-----+-----+-----+
我知道可以通过复制数据、删除 extra
列来实现这一点
I know that it is possible to do this by duplicating the data, dropping the extra
column
- 复制数据
- 删除
extra
列 - 执行一个
distinct
操作,这将导致原始示例中的数据 - 使用原始解决方案计算
index
列 - 将结果与第二个示例中的数据连接
- Duplicating the data
- Dropping the
extra
column - Performing a
distinct
operation, which would result in data in the original example - Compute the
index
column using the original solution - Join the result with the data from the second example
然而,这将涉及大量额外的计算和开销.
However, this would involve a lot of extra computation and overhead.
推荐答案
您可以使用 Window
函数创建基于 value
的排名列,由 分区group_id
:
You can use Window
functions to create a rank column based on value
, partitioned by group_id
:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, dense_rank
# Define window
window = Window.partitionBy(df['group_id']).orderBy(df['value'])
# Create column
df.select('*', rank().over(window).alias('index')).show()
+--------+-----+-----+
|group_id|value|index|
+--------+-----+-----+
| 1| -1.7| 1|
| 1| 0.0| 2|
| 1| 1.3| 3|
| 1| 2.7| 4|
| 1| 3.4| 5|
| 2| 0.8| 1|
| 2| 2.3| 2|
| 2| 5.9| 3|
+--------+-----+-----+
因为,您首先选择了 '*'
,因此您也使用上述代码保留了所有其他变量.但是,您的第二个示例表明您正在寻找函数 dense_rank()
,它作为一个没有间隙的排名列提供:
Because, you first select '*'
, you keep all other variables using the above code as well. However, your second example shows that you are looking for the function dense_rank()
, which gives as a rank column with no gaps:
df.select('*', dense_rank().over(window).alias('index'))
这篇关于为 Spark 数据帧中的每个组创建索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!