collect_list()是否保持行的相对顺序? [英] Does collect_list() maintain relative ordering of rows?

查看:198
本文介绍了collect_list()是否保持行的相对顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下我有以下DataFrame df:

Imagine that I have the following DataFrame df:

+---+-----------+------------+
| id|featureName|featureValue|
+---+-----------+------------+
|id1|          a|           3|
|id1|          b|           4|
|id2|          a|           2|
|id2|          c|           5|
|id3|          d|           9|
+---+-----------+------------+

想象一下我跑步:

df.groupBy("id")
  .agg(collect_list($"featureIndex").as("idx"),
       collect_list($"featureValue").as("val"))

保证,"idx"和"val"将被汇总并保持其相对顺序吗?即

Am I GUARANTEED that "idx" and "val" will be aggregated and keep their relative order? i.e.

GOOD                   GOOD                   BAD
+---+------+------+    +---+------+------+    +---+------+------+
| id|   idx|   val|    | id|   idx|   val|    | id|   idx|   val|
+---+------+------+    +---+------+------+    +---+------+------+
|id3|   [d]|   [9]|    |id3|   [d]|   [9]|    |id3|   [d]|   [9]|
|id1|[a, b]|[3, 4]|    |id1|[b, a]|[4, 3]|    |id1|[a, b]|[4, 3]|
|id2|[a, c]|[2, 5]|    |id2|[c, a]|[5, 2]|    |id2|[a, c]|[5, 2]|
+---+------+------+    +---+------+------+    +---+------+------+

注意:例如这是错误的,因为对于id1 [a,b]应该已经与[3,4](而不是[4,3])相关联.与id2相同

NOTE: e.g. It's BAD because for id1 [a, b] should have been associated with [3, 4] (and not [4, 3]). Same for id2

推荐答案

我认为您可以依赖它们的相对顺序" ,因为Spark依次逐行遍历行(和通常(如果没有明确要求,则不会对行进行重新排序).

I think you can rely on "their relative order" as Spark goes over rows one by one in order (and usually does not re-order rows if not explicitly needed).

如果您担心订单,请使用

If you are concerned with the order, merge these two columns using struct function before doing groupBy.

struct(colName:String,colNames:String *):列创建一个由多个输入列组成的新struct列.

struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.

您还可以使用

You could also use monotonically_increasing_id function to number records and use it to pair with the other columns (perhaps using struct):

monotonically_increasing_id():列一种列表达式,可生成单调递增的64位整数.

monotonically_increasing_id(): Column A column expression that generates monotonically increasing 64-bit integers.

生成的ID可以保证单调递增且唯一,但不能连续.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

这篇关于collect_list()是否保持行的相对顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆