在 Spark SQL 中使用 collect_list 和 collect_set [英] Use collect_list and collect_set in Spark SQL

查看：207 发布时间：2021/11/14 21:20:44 apache-spark hive apache-spark-sql

本文介绍了在 Spark SQL 中使用 collect_list 和 collect_set的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

根据docs，collect_set 和 collect_list 函数应该在 Spark SQL 中可用.但是，我无法让它工作.我正在使用 Docker 映像运行 Spark 1.6.0.

According to the docs, the collect_set and collect_list functions should be available in Spark SQL. However, I cannot get it to work. I'm running Spark 1.6.0 using a Docker image.

我正在尝试在 Scala 中执行此操作:

I'm trying to do this in Scala:

import org.apache.spark.sql.functions._ 

df.groupBy("column1") 
  .agg(collect_set("column2")) 
  .show()

并在运行时收到以下错误:

And receive the following error at runtime:

Exception in thread "main" org.apache.spark.sql.AnalysisException: undefined function collect_set;

也使用 pyspark 尝试过，但它也失败了.文档说明这些函数是 Hive UDAF 的别名，但我不知道如何启用这些函数.

Also tried it using pyspark, but it also fails. The docs state these functions are aliases of Hive UDAFs, but I can't figure out to enable these functions.

如何解决这个问题?谢谢！

How to fix this? Thanx!

推荐答案

Spark 2.0+:

SPARK-10605 引入了原生 collect_list 和collect_set 实现.不再需要带有 Hive 支持的 SparkSession 或 HiveContext.

SPARK-10605 introduced native collect_list and collect_set implementation. SparkSession with Hive support or HiveContext are no longer required.

Spark 2.0-SNAPSHOT (2016-05-03 之前):

您必须为给定的 SparkSession 启用 Hive 支持:

You have to enable Hive support for a given SparkSession:

在 Scala 中:

val spark = SparkSession.builder
  .master("local")
  .appName("testing")
  .enableHiveSupport()  // <- enable Hive support.
  .getOrCreate()

在 Python 中:

In Python:

spark = (SparkSession.builder
    .enableHiveSupport()
    .getOrCreate())

火花<2.0:

为了能够使用 Hive UDF(参见 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)你已经使用了带有 Hive 支持的 Spark(当你使用预构建的二进制文件时已经涵盖了这里似乎是这种情况)并初始化 SparkContext 使用 HiveContext.

To be able to use Hive UDFs (see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) you have use Spark built with Hive support (this is already covered when you use pre-built binaries what seems to be the case here) and initialize SparkContext using HiveContext.

在 Scala 中:

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext

val sqlContext: SQLContext = new HiveContext(sc)

在 Python 中:

In Python:

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc)

这篇关于在 Spark SQL 中使用 collect_list 和 collect_set的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 Spark SQL 中使用 collect_list 和 collect_set [英] Use collect_list and collect_set in Spark SQL

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 Spark SQL 中使用 collect_list 和 collect_set [英] Use collect_list and collect_set in Spark SQL

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭