为 Spark 数据帧中的每个组创建索引 [英] Creating indices for each group in Spark dataframe

查看：28 发布时间：2021/11/14 21:56:27 apache-spark apache-spark-sql

本文介绍了为 Spark 数据帧中的每个组创建索引的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 Spark 中有一个数据框，有 2 列，group_id 和 value，其中 value 是双精度值.我想根据group_id对数据进行分组，按value对每组进行排序，然后添加代表位置的第三列indexvalue 按组的值排序.

I have a dataframe in Spark with 2 columns, group_id and value, where value is a double. I would like to group the data based on the group_id, order each group by value, and then add a third column index that represents the position of value in the ordering of values for the group.

例如，考虑以下输入数据:

For example, considering the following input data:

+--------+-----+
|group_id|value|
+--------+-----+
|1       |1.3  |
|2       |0.8  |
|1       |3.4  |
|1       |-1.7 |
|2       |2.3  |
|2       |5.9  |
|1       |2.7  |
|1       |0.0  |
+--------+-----+

输出将类似于

+--------+-----+-----+
|group_id|value|index|
+--------+-----+-----+
|1       |-1.7 |1    |
|1       |0.0  |2    |
|1       |1.3  |3    |
|1       |2.7  |4    |
|1       |3.4  |5    |
|2       |0.8  |1    |
|2       |2.3  |2    |
|2       |5.9  |3    |
+--------+-----+-----+

索引是否从 0 开始以及排序是升序还是降序并不重要.

It is unimportant if the index is 0-based and whether the sort is ascending or descending.

作为后续，考虑在原始数据中有第三列 extra 的情况，该列对某些 (group_id, value) 组合.一个例子是:

As a follow-up, consider the case where there is a third column, extra, in the original data that takes on multiple values for some (group_id, value) combinations. An example is:

+--------+-----+-----+
|group_id|value|extra|
+--------+-----+-----+
|1       |1.3  |1    |
|1       |1.3  |2    |
|2       |0.8  |1    |
|1       |3.4  |1    |
|1       |3.4  |2    |
|1       |3.4  |3    |
|1       |-1.7 |1    |
|2       |2.3  |1    |
|2       |5.9  |1    |
|1       |2.7  |1    |
|1       |0.0  |1    |
+--------+-----+-----+

有没有办法添加一个 index 列，这样 extra 列不被考虑但仍然保留?在这种情况下的输出将是

Is there a way to add an index column such that the extra column is not considered but still kept? The output in this case would be

+--------+-----+-----+-----+
|group_id|value|extra|index|
+--------+-----+-----+-----+
|1       |-1.7 |1    |1    |
|1       |0.0  |1    |2    |
|1       |1.3  |1    |3    |
|1       |1.3  |2    |3    |
|1       |2.7  |1    |4    |
|1       |3.4  |1    |5    |
|1       |3.4  |2    |5    |
|1       |3.4  |3    |5    |
|2       |0.8  |1    |1    |
|2       |2.3  |1    |2    |
|2       |5.9  |1    |3    |
+--------+-----+-----+-----+

我知道可以通过复制数据、删除 extra 列来实现这一点

I know that it is possible to do this by duplicating the data, dropping the extra column

复制数据
删除 extra 列
执行一个 distinct 操作，这将导致原始示例中的数据
使用原始解决方案计算 index 列
将结果与第二个示例中的数据连接

Duplicating the data
Dropping the extra column
Performing a distinct operation, which would result in data in the original example
Compute the index column using the original solution
Join the result with the data from the second example

然而，这将涉及大量额外的计算和开销.

However, this would involve a lot of extra computation and overhead.

推荐答案

您可以使用 Window 函数创建基于 value 的排名列，由 分区group_id:

You can use Window functions to create a rank column based on value, partitioned by group_id:

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, dense_rank
# Define window
window = Window.partitionBy(df['group_id']).orderBy(df['value'])
# Create column
df.select('*', rank().over(window).alias('index')).show()
+--------+-----+-----+
|group_id|value|index|
+--------+-----+-----+
|       1| -1.7|    1|
|       1|  0.0|    2|
|       1|  1.3|    3|
|       1|  2.7|    4|
|       1|  3.4|    5|
|       2|  0.8|    1|
|       2|  2.3|    2|
|       2|  5.9|    3|
+--------+-----+-----+

因为，您首先选择了 '*'，因此您也使用上述代码保留了所有其他变量.但是，您的第二个示例表明您正在寻找函数 dense_rank()，它作为一个没有间隙的排名列提供:

Because, you first select '*', you keep all other variables using the above code as well. However, your second example shows that you are looking for the function dense_rank(), which gives as a rank column with no gaps:

df.select('*', dense_rank().over(window).alias('index'))

这篇关于为 Spark 数据帧中的每个组创建索引的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为 Spark 数据帧中的每个组创建索引 [英] Creating indices for each group in Spark dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为 Spark 数据帧中的每个组创建索引 [英] Creating indices for each group in Spark dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭