在 Spark 中使用窗口函数 [英] Using windowing functions in Spark

查看:32
本文介绍了在 Spark 中使用窗口函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Spark 数据帧中使用 rowNumber.我的查询在 Spark shell 中按预期工作.但是当我在 eclipse 中写出它们并编译一个 jar 时,我遇到了一个错误

I am trying to use rowNumber in Spark data frames. My queries are working as expected in Spark shell. But when i write them out in eclipse and compile a jar, i am facing an error

 16/03/23 05:52:43 ERROR ApplicationMaster: User class threw exception:org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;

我的疑问

import org.apache.spark.sql.functions.{rowNumber, max, broadcast}
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"id").orderBy($"value".desc)

val dfTop = df.withColumn("rn", rowNumber.over(w)).where($"rn" <= 3).drop("rn")

在 Spark shell 中运行查询时,我没有使用 HiveContext.不知道为什么当我像 jar 文件一样运行时它会返回错误.如果有帮助,我还在 Spark 1.6.0 上运行脚本.有没有人遇到过类似的问题?

I am not using HiveContext while running the queries in Spark shell. Not sure why it is returning an error when i run the same as a jar file. And also I am running the scripts on Spark 1.6.0 if that helps. Did anyone face similar issue?

推荐答案

我之前已经回答过一个类似问题.错误信息说明了一切.随着 spark <2.x 版,您的应用程序 jar 中需要一个 HiveContext,别无他法.

I have already answered a similar question before. The error message says all. With spark < version 2.x, you'll need a HiveContext in your application jar, no other way around.

您可以在 此处进一步了解 SQLContext 和 HiveContext 之间的区别.

SparkSQL 有一个 SQLContext 和一个 HiveContext.HiveContextSQLContext 的超集.Spark 社区建议使用 HiveContext.您可以看到,当您运行 spark-shell(您的交互式驱动程序应用程序)时,它会自动创建一个定义为 sc 的 SparkContext 和一个定义为 sqlContext<的 HiveContext/代码>.HiveContext 允许您执行 SQL 查询以及 Hive 命令.

SparkSQL has a SQLContext and a HiveContext. HiveContext is a super set of the SQLContext. The Spark community suggest using the HiveContext. You can see that when you run spark-shell, which is your interactive driver application, it automatically creates a SparkContext defined as sc and a HiveContext defined as sqlContext. The HiveContext allows you to execute SQL queries as well as Hive commands.

您可以尝试检查 spark-shell 的内部:

You can try to check that inside of your spark-shell :

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)

scala> sqlContext.isInstanceOf[org.apache.spark.sql.hive.HiveContext]
res0: Boolean = true

scala> sqlContext.isInstanceOf[org.apache.spark.sql.SQLContext]
res1: Boolean = true

scala> sqlContext.getClass.getName
res2: String = org.apache.spark.sql.hive.HiveContext

通过继承,HiveContext 实际上是一个 SQLContext,但反过来就不是这样了.您可以查看 源代码 如果您更想知道 HiveContext 如何从 SQLContext 继承.

By inheritance, HiveContext is actually an SQLContext, but it's not true the other way around. You can check the source code if you are more intersted in knowing how does HiveContext inherits from SQLContext.

spark 2.0 起,您只需要创建一个 SparkSession(作为单一入口点)来移除 HiveContext/SQLContext 混淆问题.

Since spark 2.0, you'll just need to create a SparkSession (as the single entry point) which removes the HiveContext/SQLContext confusion issue.

这篇关于在 Spark 中使用窗口函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆