collect_list() 是否保持行的相对顺序? [英] Does collect_list() maintain relative ordering of rows?
问题描述
假设我有以下 DataFrame df:
Imagine that I have the following DataFrame df:
+---+-----------+------------+
| id|featureName|featureValue|
+---+-----------+------------+
|id1| a| 3|
|id1| b| 4|
|id2| a| 2|
|id2| c| 5|
|id3| d| 9|
+---+-----------+------------+
想象一下我在跑步:
df.groupBy("id")
.agg(collect_list($"featureIndex").as("idx"),
collect_list($"featureValue").as("val"))
我是否保证idx"和val"将被聚合并保持它们的相对顺序?即
Am I GUARANTEED that "idx" and "val" will be aggregated and keep their relative order? i.e.
GOOD GOOD BAD
+---+------+------+ +---+------+------+ +---+------+------+
| id| idx| val| | id| idx| val| | id| idx| val|
+---+------+------+ +---+------+------+ +---+------+------+
|id3| [d]| [9]| |id3| [d]| [9]| |id3| [d]| [9]|
|id1|[a, b]|[3, 4]| |id1|[b, a]|[4, 3]| |id1|[a, b]|[4, 3]|
|id2|[a, c]|[2, 5]| |id2|[c, a]|[5, 2]| |id2|[a, c]|[5, 2]|
+---+------+------+ +---+------+------+ +---+------+------+
注意:例如这很糟糕,因为对于 id1 [a, b] 应该与 [3, 4](而不是 [4, 3])相关联.id2 也一样
NOTE: e.g. It's BAD because for id1 [a, b] should have been associated with [3, 4] (and not [4, 3]). Same for id2
推荐答案
我认为你可以依赖 他们的相对顺序",因为 Spark 按顺序逐行(和 如果没有明确需要,通常不会对行重新排序).
I think you can rely on "their relative order" as Spark goes over rows one by one in order (and usually does not re-order rows if not explicitly needed).
如果您关心顺序,请使用 struct 函数,然后再执行 groupBy
.
If you are concerned with the order, merge these two columns using struct function before doing groupBy
.
struct(colName: String, colNames: String*): Column 创建一个由多个输入列组成的新结构列.
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
你也可以使用 monotonically_increasing_id 函数对记录进行编号并使用它与其他列配对(可能使用 struct
):
You could also use monotonically_increasing_id function to number records and use it to pair with the other columns (perhaps using struct
):
monotonically_increasing_id(): Column 生成单调递增的 64 位整数的列表达式.
monotonically_increasing_id(): Column A column expression that generates monotonically increasing 64-bit integers.
生成的 ID 保证单调递增且唯一,但不连续.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
这篇关于collect_list() 是否保持行的相对顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!