如何在Spark 1.3中通过Hive在SparkSQL中指示数据库 [英] How to indicate the database in SparkSQL over Hive in Spark 1.3

查看:252
本文介绍了如何在Spark 1.3中通过Hive在SparkSQL中指示数据库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的Scala代码,该代码从Hive数据库检索数据并从结果集中创建RDD.它与HiveContext一起正常工作.代码类似于以下内容:

val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd

我正在使用的Spark版本是1.3.问题是hive.execution.engine的默认设置是'mr',这使Hive使用慢的MapReduce.不幸的是,我不能强迫它使用"spark". 我尝试通过替换hc = new SQLContext(sc)来使用SQLContext,以查看性能是否会提高.进行此更改

hc.sql("use myDatabase")

引发以下异常:

Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found

use myDatabase
^

Spark 1.3文档说SparkSQL可以与Hive表一起使用.我的问题是如何指示我要使用某个数据库而不是默认数据库.

解决方案

使用数据库

更高的Spark版本支持

https://docs.databricks .com/spark/latest/spark-sql/language-manual/use-database.html

您需要将语句放在两个单独的spark.sql调用中,如下所示:

spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show

I have a simple Scala code that retrieves data from the Hive database and creates an RDD out of the result set. It works fine with HiveContext. The code is similar to this:

val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd

The version of Spark that I'm using is 1.3. The problem is that the default setting for hive.execution.engine is 'mr' that makes Hive to use MapReduce which is slow. Unfortunately I can't force it to use "spark". I tried to use SQLContext by replacing hc = new SQLContext(sc) to see if performance will improve. With this change the line

hc.sql("use myDatabase")

is throwing the following exception:

Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found

use myDatabase
^

The Spark 1.3 documentation says that SparkSQL can work with Hive tables. My question is how to indicate that I want to use a certain database instead of the default one.

解决方案

use database

is supported in later Spark versions

https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html

You need to put the statement in two separate spark.sql calls like this:

spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show

这篇关于如何在Spark 1.3中通过Hive在SparkSQL中指示数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆