如何从火花连接到远程配置单元服务器 [英] How to connect to remote hive server from spark

查看:87
本文介绍了如何从火花连接到远程配置单元服务器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在本地运行spark,并希望访问位于远程Hadoop集群中的Hive表。



我可以访问配置单元表通过在SPARK_HOME下lauching beeline

  [ml @ master spark-2.0.0] $。/ bin / beeline 
直线版本1.2.1.spark2由Apache Hive
直线> !连接jdbc:hive2:// remote_hive:10000
连接到jdbc:hive2:// remote_hive:10000
输入jdbc的用户名:hive2:// remote_hive:10000:root
输入密码对于jdbc:hive2:// remote_hive:10000:******
SLF4J:类路径包含多个SLF4J绑定。
SLF4J:在[jar:file:/home/ml/spark/spark-2.0.0/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]中找到绑定
SLF4J:在[jar:file:/usr/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]中找到绑定
SLF4J:请参阅http://www.slf4j.org/codes.html#multiple_bindings以获取解释。
SLF4J:实际绑定类型为[org.slf4j.impl.Log4jLoggerFactory] ​​
16/10/12 19:06:39信息jdbc.Utils:提供的权限:remote_hive:10000
16 / 10/12 19:06:39 INFO jdbc.Utils:解决权限:remote_hive:10000
16/10/12 19:06:39 INFO jdbc.HiveConnection:将尝试使用JDBC打开客户端传输Uri:jdbc :hive2:// remote_hive:10000
连接到:Apache Hive(版本1.2.1000.2.4.2.0-258)
驱动程序:Hive JDBC(版本1.2.1.spark2)
交易隔离:TRANSACTION_REPEATABLE_READ
0:jdbc:hive2:// remote_hive:10000>

如何通过程序从spark访问远程hive表?


<不需要JDBC

Spark直接连接到Hive Metastore,而不是通过HiveServer2。要配置它,
$ b


  1. hive-site.xml classpath ,并指定 hive.metastore.uri s到您的配置单元Metastore托管的位置。另请参阅如何在SparkSQL中以编程方式连接到Hive Metastore ?


  2. 导入 org.apache.spark.sql.hive.HiveContext ,as它可以通过Hive表执行SQL查询。定义 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

  3. 验证 sqlContext.sql(show tables)以查看它是否可用


Hibernate表上的SparkSQL



结论:如果你必须用jdbc方式



查看远程连接apache spark与apache hive 。



请注意,直线也通过jdbc连接。从您的日志它自己明显。


[ml @ master spark-2.0.0] $。/ bin / beeline直线版本1.2。 1.spark2 by
Apache Hive beeline>!连接jdbc:hive2:// remote_hive:10000


$ b

连接到jdbc:hive2:// remote_hive :10000


所以请看看有趣的文章




  • 方法1:使用JDBC将表拉入Spark使用JDBC方法2:使用Spark JdbcRDD和HiveServer2 JDBC驱动程序
  • 方法3:获取数据集在客户端,然后手动创建RDD



目前HiveServer2驱动程序不允许我们使用Sparkling方法1和2,我们只能依靠方法3



下面是示例代码片段,尽管它可以实现

从数据载入数据通过HiveServer2 JDBC连接将Hadoop集群(又名远程)集成到另一个集群(我的Spark生活又名家庭)。

  import java.sql.Timestamp 
import scala.collection.mutable.MutableList

case class StatsRec(
first_name:String,
last_name:字符串,
action_dtm:时间戳,
大小:长,
size_p:长,
size_d:长


val conn:Connection = DriverManager.getConnection(url,user,password)
val res:ResultSet = conn.createStatement
.executeQuery(SELECT * FROM stats_201512301914)
val fetchedRes = MutableList [StatsRec ]()
while(res.next()){
var rec = StatsRec(res.getString(first_name),
res.getString(last_name),
time.ampl.valueOf(res.getString(action_dtm)),
res.getLong(size),
res.getLong(size_p),
res.getLong( size_d))
fetchedRes + = rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()




//基本上我们完成了。检查加载的数据:

println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)


I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster.

I'm able to access the hive tables by lauching beeline under SPARK_HOME

[ml@master spark-2.0.0]$./bin/beeline 
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
Enter username for jdbc:hive2://remote_hive:10000: root
Enter password for jdbc:hive2://remote_hive:10000: ******
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ml/spark/spark-2.0.0/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/12 19:06:39 INFO jdbc.Utils: Supplied authorities: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.Utils: Resolved authority: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://remote_hive:10000
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://remote_hive:10000>

how can I access the remote hive tables programmatically from spark?

解决方案

JDBC is not required

Spark connects directly to the Hive metastore, not through HiveServer2. To configure this,

  1. Put hive-site.xml on your classpath, and specify hive.metastore.uris to where your hive metastore hosted. Also see How to connect to a Hive metastore programmatically in SparkSQL?

  2. Import org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables.

  3. Define val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

  4. Verify sqlContext.sql("show tables") to see if it works

SparkSQL on Hive tables

Conclusion : If you must go with jdbc way

Have a look connecting apache spark with apache hive remotely.

Please note that beeline also connects through jdbc. from your log it self its evident.

[ml@master spark-2.0.0]$./bin/beeline Beeline version 1.2.1.spark2 by Apache Hive beeline> !connect jdbc:hive2://remote_hive:10000

Connecting to jdbc:hive2://remote_hive:10000

So please have a look at this interesting article

  • Method 1: Pull table into Spark using JDBC
  • Method 2: Use Spark JdbcRDD with HiveServer2 JDBC driver
  • Method 3: Fetch dataset on a client side, then create RDD manually

Currently HiveServer2 driver doesn't allow us to use "Sparkling" Method 1 and 2, we can rely only on Method 3

Below is example code snippet though which it can be achieved

Loading data from one Hadoop cluster (aka "remote") into another one (where my Spark lives aka "domestic") thru HiveServer2 JDBC connection.

import java.sql.Timestamp
import scala.collection.mutable.MutableList

case class StatsRec (
  first_name: String,
  last_name: String,
  action_dtm: Timestamp,
  size: Long,
  size_p: Long,
  size_d: Long
)

val conn: Connection = DriverManager.getConnection(url, user, password)
val res: ResultSet = conn.createStatement
                   .executeQuery("SELECT * FROM stats_201512301914")
val fetchedRes = MutableList[StatsRec]()
while(res.next()) {
  var rec = StatsRec(res.getString("first_name"), 
     res.getString("last_name"), 
     Timestamp.valueOf(res.getString("action_dtm")), 
     res.getLong("size"), 
     res.getLong("size_p"), 
     res.getLong("size_d"))
  fetchedRes += rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()




 // Basically we are done. To check loaded data:

println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)

这篇关于如何从火花连接到远程配置单元服务器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆