在Spark SQL的一个查询中使用多个collect_list [英] Use more than one collect_list in one query in Spark SQL
问题描述
我有以下数据框: data
:
root
| - userId:string
| - product:string
| - rating:double
和以下查询:
pre $ val result = sqlContext.sql(select userId,collect_list(product ),collect_list(rating)from userId)
我的问题是,<聚合数组中的c $ c> product 和 rating
是否相互匹配?也就是说,同一行中的 product
和 rating
是否在聚合数组中具有相同的索引。
更新:
从Spark 2.0.0开始,可以对结构类型执行 collect_list
,所以我们可以做一个 collect_list
在一个组合列上。但对于2.0.0以前版本,只能在原始类型上使用 collect_list
。 div>
我相信没有明确的保证,所有的数组都会有相同的顺序。 Spark SQL使用多次优化,并且在某些情况下,不能保证所有聚合都在同一时间进行计划(一个示例是使用 DISTINCT
进行聚合)。由于交换(洗牌)的结果是不确定的顺序,所以理论上可能的顺序会有所不同。
collect_list
聚合非原子列: p> SELECT userId,collect_list(struct(product,rating))FROM data GROUP BY userId
如果您使用早期版本,您可以尝试使用明确的分区和顺序:
<$ p $ (
SELECT * FROM data DISTRIBUTE BY userId SORT BY userId,product,rating
)
SELECT userId,collect_list(product),collect_list(rating) )
FROM tmp
GROUP BY userId
I have the following dataframe data
:
root
|-- userId: string
|-- product: string
|-- rating: double
and the following query:
val result = sqlContext.sql("select userId, collect_list(product), collect_list(rating) from data group by userId")
My question is that, does product
and rating
in the aggregated arrays match each other? That is, whether the product
and the rating
from the same row have the same index in the aggregated arrays.
Update:
Starting from Spark 2.0.0, one can do collect_list
on struct type so we can do one collect_list
on a combined column. But for pre 2.0.0 version, one can only use collect_list
on primitive type.
I believe there is no explicit guarantee that all arrays will have the same order. Spark SQL uses multiple optimizations and under certain conditions there is no guarantee that all aggregations are scheduled at the same time (one example is aggregation with DISTINCT
). Since exchange (shuffle) results in nondeterministic order it is theoretically possible that order will differ.
So while it should work in practice it could be risky and introduce some hard to detect bugs.
If you Spark 2.0.0 or later you can aggregate non-atomic columns with collect_list
:
SELECT userId, collect_list(struct(product, rating)) FROM data GROUP BY userId
If you use an earlier version you can try to use explicit partitions and order:
WITH tmp AS (
SELECT * FROM data DISTRIBUTE BY userId SORT BY userId, product, rating
)
SELECT userId, collect_list(product), collect_list(rating)
FROM tmp
GROUP BY userId
这篇关于在Spark SQL的一个查询中使用多个collect_list的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!