在 Spark 中使用窗口函数 [英] Using windowing functions in Spark
问题描述
我正在尝试在 Spark 数据帧中使用 rowNumber.我的查询在 Spark shell 中按预期工作.但是当我在 eclipse 中写出它们并编译一个 jar 时,我遇到了一个错误
I am trying to use rowNumber in Spark data frames. My queries are working as expected in Spark shell. But when i write them out in eclipse and compile a jar, i am facing an error
16/03/23 05:52:43 ERROR ApplicationMaster: User class threw exception:org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
我的疑问
import org.apache.spark.sql.functions.{rowNumber, max, broadcast}
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"id").orderBy($"value".desc)
val dfTop = df.withColumn("rn", rowNumber.over(w)).where($"rn" <= 3).drop("rn")
在 Spark shell 中运行查询时,我没有使用 HiveContext.不知道为什么当我像 jar 文件一样运行时它会返回错误.如果有帮助,我还在 Spark 1.6.0 上运行脚本.有没有人遇到过类似的问题?
I am not using HiveContext while running the queries in Spark shell. Not sure why it is returning an error when i run the same as a jar file. And also I am running the scripts on Spark 1.6.0 if that helps. Did anyone face similar issue?
推荐答案
我之前已经回答过一个类似问题.错误信息说明了一切.随着 spark <2.x 版,您的应用程序 jar 中需要一个 HiveContext
,别无他法.
I have already answered a similar question before. The error message says all. With spark < version 2.x, you'll need a HiveContext
in your application jar, no other way around.
您可以在 此处进一步了解 SQLContext 和 HiveContext 之间的区别.
SparkSQL
有一个 SQLContext
和一个 HiveContext
.HiveContext
是 SQLContext
的超集.Spark 社区建议使用 HiveContext
.您可以看到,当您运行 spark-shell(您的交互式驱动程序应用程序)时,它会自动创建一个定义为 sc 的 SparkContext
和一个定义为 sqlContext<的
HiveContext
/代码>.HiveContext
允许您执行 SQL 查询以及 Hive 命令.
SparkSQL
has a SQLContext
and a HiveContext
. HiveContext
is a super set of the SQLContext
. The Spark community suggest using the HiveContext
. You can see that when you run spark-shell, which is your interactive driver application, it automatically creates a SparkContext
defined as sc and a HiveContext
defined as sqlContext
. The HiveContext
allows you to execute SQL queries as well as Hive commands.
您可以尝试检查 spark-shell
的内部:
You can try to check that inside of your spark-shell
:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_74)
scala> sqlContext.isInstanceOf[org.apache.spark.sql.hive.HiveContext]
res0: Boolean = true
scala> sqlContext.isInstanceOf[org.apache.spark.sql.SQLContext]
res1: Boolean = true
scala> sqlContext.getClass.getName
res2: String = org.apache.spark.sql.hive.HiveContext
通过继承,HiveContext
实际上是一个 SQLContext
,但反过来就不是这样了.您可以查看 源代码 如果您更想知道 HiveContext
如何从 SQLContext
继承.
By inheritance, HiveContext
is actually an SQLContext
, but it's not true the other way around. You can check the source code if you are more intersted in knowing how does HiveContext
inherits from SQLContext
.
自 spark 2.0 起,您只需要创建一个 SparkSession
(作为单一入口点)来移除 HiveContext
/SQLContext
混淆问题.
Since spark 2.0, you'll just need to create a SparkSession
(as the single entry point) which removes the HiveContext
/SQLContext
confusion issue.
这篇关于在 Spark 中使用窗口函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!