使用 Apache Spark 查询多个 Hive 存储 [英] Querying on multiple Hive stores using Apache Spark
问题描述
我有一个 spark 应用程序,它将成功连接到 hive 并使用 spark 引擎查询 hive 表.
为了构建它,我只是将 hive-site.xml
添加到应用程序的类路径中,spark 将读取 hive-site.xml
以连接到它的 Metastore.这个方法是在spark的邮件列表中推荐的.
到目前为止一切顺利.现在我想连接到两个 hive 存储,我认为在我的类路径中添加另一个 hive-site.xml
不会有帮助.我参考了很多文章和 Spark 邮件列表,但找不到任何人这样做.
有人可以建议我如何实现这一目标吗?
谢谢.
引用的文档:
我还没有尝试过的东西
环境 1 使用 HiveContext,环境 2 使用 SqlContext
希望这会有用.
I have a spark application which will successfully connect to hive and query on hive tables using spark engine.
To build this, I just added
hive-site.xml
to classpath of the application and spark will read thehive-site.xml
to connect to its metastore. This method was suggested in spark's mailing list.So far so good. Now I want to connect to two hive stores and I don't think adding another
hive-site.xml
to my classpath will be helpful. I referred quite a few articles and spark mailing lists but could not find anyone doing this.Can someone suggest how I can achieve this?
Thanks.
Docs referred:
解决方案I think this is possible by making use of Spark SQL capability of connecting and reading data from remote databases using JDBC.
After an exhaustive R & D, I was successfully able to connect to two different hive environments using JDBC and load the hive tables as DataFrames into Spark for further processing.
Environment details
hadoop-2.6.0
apache-hive-2.0.0-bin
spark-1.3.1-bin-hadoop2.6
Code Sample HiveMultiEnvironment.scala
import org.apache.spark.SparkConf import org.apache.spark.sql.SQLContext import org.apache.spark.SparkContext object HiveMultiEnvironment { def main(args: Array[String]) { var conf = new SparkConf().setAppName("JDBC").setMaster("local") var sc = new SparkContext(conf) var sqlContext = new SQLContext(sc) // load hive table (or) sub-query from Environment 1 val jdbcDF1 = sqlContext.load("jdbc", Map( "url" -> "jdbc:hive2://<host1>:10000/<db>", "dbtable" -> "<db.tablename or subquery>", "driver" -> "org.apache.hive.jdbc.HiveDriver", "user" -> "<username>", "password" -> "<password>")) jdbcDF1.foreach { println } // load hive table (or) sub-query from Environment 2 val jdbcDF2 = sqlContext.load("jdbc", Map( "url" -> "jdbc:hive2://<host2>:10000/<db>", "dbtable" -> "<db.tablename> or <subquery>", "driver" -> "org.apache.hive.jdbc.HiveDriver", "user" -> "<username>", "password" -> "<password>")) jdbcDF2.foreach { println } } // todo: business logic }
Other parameters can also be set during load using SqlContext such as setting partitionColumn. Details found under 'JDBC To Other Databases' section in Spark reference doc: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html
Build path from Eclipse:
What I Haven't Tried
Use of HiveContext for Environment 1 and SqlContext for environment 2
Hope this will be useful.
这篇关于使用 Apache Spark 查询多个 Hive 存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!