如何在Spark 1.3中通过Hive在SparkSQL中指示数据库 [英] How to indicate the database in SparkSQL over Hive in Spark 1.3
问题描述
我有一个简单的Scala代码,该代码从Hive数据库检索数据并从结果集中创建RDD.它与HiveContext一起正常工作.代码类似于以下内容:
val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd
我正在使用的Spark版本是1.3.问题是hive.execution.engine的默认设置是'mr',这使Hive使用慢的MapReduce.不幸的是,我不能强迫它使用"spark". 我尝试通过替换hc = new SQLContext(sc)来使用SQLContext,以查看性能是否会提高.进行此更改
hc.sql("use myDatabase")
引发以下异常:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found
use myDatabase
^
Spark 1.3文档说SparkSQL可以与Hive表一起使用.我的问题是如何指示我要使用某个数据库而不是默认数据库.
使用数据库
更高的Spark版本支持https://docs.databricks .com/spark/latest/spark-sql/language-manual/use-database.html
您需要将语句放在两个单独的spark.sql
调用中,如下所示:
spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show
I have a simple Scala code that retrieves data from the Hive database and creates an RDD out of the result set. It works fine with HiveContext. The code is similar to this:
val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd
The version of Spark that I'm using is 1.3. The problem is that the default setting for hive.execution.engine is 'mr' that makes Hive to use MapReduce which is slow. Unfortunately I can't force it to use "spark". I tried to use SQLContext by replacing hc = new SQLContext(sc) to see if performance will improve. With this change the line
hc.sql("use myDatabase")
is throwing the following exception:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found
use myDatabase
^
The Spark 1.3 documentation says that SparkSQL can work with Hive tables. My question is how to indicate that I want to use a certain database instead of the default one.
use database
is supported in later Spark versions
https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html
You need to put the statement in two separate spark.sql
calls like this:
spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show
这篇关于如何在Spark 1.3中通过Hive在SparkSQL中指示数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!