Scala Spark collect_list()与array() [英] Scala Spark collect_list() vs array()

查看:88
本文介绍了Scala Spark collect_list()与array()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用scala在spark中 collect_list() array()之间有什么区别?

What is the difference between collect_list() and array() in spark using scala?

我看到了各地的用途,用例对我来说尚不清楚,以确定它们之间的区别.

I see uses all over the place and the use cases are not clear to me to determine the difference.

推荐答案

即使两个

Even though both array and collect_list return an ArrayType column, the two methods are very different.

方法 array 将列"的多个列组合到一个数组中,而 collect_list 通常在单个列上按组(或按行")聚合(或 Window 分区)分成一个数组,如下所示:

Method array combines "column-wise" a number of columns into an array, whereas collect_list aggregates "row-wise" on a single column typically by group (or Window partition) into an array, as shown below:

import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  (1, "a", "b"),
  (1, "c", "d"),
  (2, "e", "f")
).toDF("c1", "c2", "c3")

df.
  withColumn("arr", array("c2", "c3")).
  show
// +---+---+---+------+
// | c1| c2| c3|   arr|
// +---+---+---+------+
// |  1|  a|  b|[a, b]|
// |  1|  c|  d|[c, d]|
// |  2|  e|  f|[e, f]|
// +---+---+---+------+

df.
  groupBy("c1").agg(collect_list("c2")).
  show
// +---+----------------+
// | c1|collect_list(c2)|
// +---+----------------+
// |  1|          [a, c]|
// |  2|             [e]|
// +---+----------------+

这篇关于Scala Spark collect_list()与array()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆