在 Spark SQL 中的一个查询中使用多个 collect_list [英] Use more than one collect_list in one query in Spark SQL

查看:41
本文介绍了在 Spark SQL 中的一个查询中使用多个 collect_list的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框data:

root
 |-- userId: string 
 |-- product: string 
 |-- rating: double

以及以下查询:

val result = sqlContext.sql("select userId, collect_list(product), collect_list(rating) from data group by userId")

我的问题是,聚合数组中的 productrating 是否相互匹配?即,同一行的 productrating 在聚合数组中是否具有相同的索引.

My question is that, does product and rating in the aggregated arrays match each other? That is, whether the product and the rating from the same row have the same index in the aggregated arrays.

更新:从 Spark 2.0.0 开始,可以在结构体类型上做 collect_list,所以我们可以在组合列上做一个 collect_list.但是对于 2.0.0 之前的版本,只能在原始类型上使用 collect_list.

Update: Starting from Spark 2.0.0, one can do collect_list on struct type so we can do one collect_list on a combined column. But for pre 2.0.0 version, one can only use collect_list on primitive type.

推荐答案

我相信没有明确保证所有数组都具有相同的顺序.Spark SQL 使用多种优化,并且在某些条件下无法保证所有聚合都在同一时间进行调度(一个示例是使用 DISTINCT 进行聚合).由于交换(洗牌)会导致不确定的顺序,因此理论上顺序可能会有所不同.

I believe there is no explicit guarantee that all arrays will have the same order. Spark SQL uses multiple optimizations and under certain conditions there is no guarantee that all aggregations are scheduled at the same time (one example is aggregation with DISTINCT). Since exchange (shuffle) results in nondeterministic order it is theoretically possible that order will differ.

因此,虽然它应该在实践中起作用,但它可能存在风险并引入一些难以检测的错误.

So while it should work in practice it could be risky and introduce some hard to detect bugs.

如果您使用 Spark 2.0.0 或更高版本,您可以使用 collect_list 聚合非原子列:

If you Spark 2.0.0 or later you can aggregate non-atomic columns with collect_list:

SELECT userId, collect_list(struct(product, rating)) FROM data GROUP BY userId

如果您使用较早的版本,您可以尝试使用显式分区和顺序:

If you use an earlier version you can try to use explicit partitions and order:

WITH tmp AS (
  SELECT * FROM data DISTRIBUTE BY userId SORT BY userId, product, rating
)
SELECT userId, collect_list(product), collect_list(rating)
FROM tmp
GROUP BY userId

这篇关于在 Spark SQL 中的一个查询中使用多个 collect_list的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆