如何从 Spark 连接到远程配置单元服务器 [英] How to connect to remote hive server from spark

查看:48
本文介绍了如何从 Spark 连接到远程配置单元服务器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在本地运行 spark,想要访问位于远程 Hadoop 集群中的 Hive 表.

我可以通过在 SPARK_HOME 下启动直线来访问 hive 表

[ml@master spark-2.0.0]$./bin/beelineApache Hive 的 Beeline 版本 1.2.1.spark2直线>!connect jdbc:hive2://remote_hive:10000连接到 jdbc:hive2://remote_hive:10000输入 jdbc 的用户名:hive2://remote_hive:10000:root输入 jdbc 的密码:hive2://remote_hive:10000:******SLF4J:类路径包含多个 SLF4J 绑定.SLF4J:在 [jar:file:/home/ml/spark/spark-2.0.0/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] 中找到绑定SLF4J:在 [jar:file:/usr/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] 中找到绑定SLF4J:有关解释,请参阅 http://www.slf4j.org/codes.html#multiple_bindings.SLF4J:实际绑定类型为 [org.slf4j.impl.Log4jLoggerFactory]16/10/12 19:06:39 INFO jdbc.Utils:提供的权限:remote_hive:1000016/10/12 19:06:39 INFO jdbc.Utils:解析权限:remote_hive:1000016/10/12 19:06:39 INFO jdbc.HiveConnection:将尝试使用 JDBC Uri 打开客户端传输:jdbc:hive2://remote_hive:10000连接到:Apache Hive(版本 1.2.1000.2.4.2.0-258)驱动程序:Hive JDBC(版本 1.2.1.spark2)事务隔离:TRANSACTION_REPEATABLE_READ0: jdbc:hive2://remote_hive:10000>

如何从 spark 以编程方式访问远程配置单元表?

解决方案

JDBC 不是必需的

Spark 直接连接到 Hive 元存储,而不是通过 HiveServer2.要配置它,

  1. hive-site.xml 放在您的 classpath 上,并指定 hive.metastore.uris 到您的 hive 元存储位置主持.另请参阅 如何在 SparkSQL 中以编程方式连接到 Hive 元存储?

  2. 导入 org.apache.spark.sql.hive.HiveContext,因为它可以对 Hive 表执行 SQL 查询.

  3. 定义val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

  4. 验证sqlContext.sql("show tables")看看是否有效

Hive 表上的 SparkSQL

结论:如果你必须使用jdbc方式

查看将 apache spark 与 apache hive 远程连接.

请注意,beeline 也是通过 jdbc 连接的.从您的日志中可以看出这是显而易见的.

<块引用>

[ml@master spark-2.0.0]$./bin/beeline Beeline 版本 1.2.1.spark2 byApache Hive beeline> !connect jdbc:hive2://remote_hive:10000

连接到 jdbc:hive2://remote_hive:10000

所以请看看这个 有趣文章

  • 方法一:使用JDBC拉表到Spark
  • 方法 2:将 Spark JdbcRDD 与 HiveServer2 JDBC 驱动程序结合使用
  • 方法 3:在客户端获取数据集,然后手动创建 RDD

目前HiveServer2驱动不允许我们使用Sparkling"方法一和方法二,只能依赖方法三

下面是可以实现的示例代码片段

通过 HiveServer2 JDBC 连接将数据从一个 Hadoop 集群(又名远程")加载到另一个(我的 Spark 所在的地方,也就是国内").

import java.sql.Timestamp导入 scala.collection.mutable.MutableList案例类 StatsRec (名字:字符串,姓氏:字符串,action_dtm:时间戳,尺寸:长,size_p:长,size_d:长)val conn: Connection = DriverManager.getConnection(url, user, password)val res: ResultSet = conn.createStatement.executeQuery("SELECT * FROM stats_201512301914")val fetchedRes = MutableList[StatsRec]()而(res.next()){var rec = StatsRec(res.getString("first_name"),res.getString("last_name"),Timestamp.valueOf(res.getString("action_dtm")),res.getLong("size"),res.getLong("size_p"),res.getLong("size_d"))fetchedRes += rec}conn.close()val rddStatsDelta = sc.parallelize(fetchedRes)rddStatsDelta.cache()//基本上我们完成了.要检查加载的数据:println(rddStatsDelta.count)rddStatsDelta.collect.take(10).foreach(println)

I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster.

I'm able to access the hive tables by lauching beeline under SPARK_HOME

[ml@master spark-2.0.0]$./bin/beeline 
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
Enter username for jdbc:hive2://remote_hive:10000: root
Enter password for jdbc:hive2://remote_hive:10000: ******
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ml/spark/spark-2.0.0/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/12 19:06:39 INFO jdbc.Utils: Supplied authorities: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.Utils: Resolved authority: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://remote_hive:10000
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://remote_hive:10000>

how can I access the remote hive tables programmatically from spark?

解决方案

JDBC is not required

Spark connects directly to the Hive metastore, not through HiveServer2. To configure this,

  1. Put hive-site.xml on your classpath, and specify hive.metastore.uris to where your hive metastore hosted. Also see How to connect to a Hive metastore programmatically in SparkSQL?

  2. Import org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables.

  3. Define val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

  4. Verify sqlContext.sql("show tables") to see if it works

SparkSQL on Hive tables

Conclusion : If you must go with jdbc way

Have a look connecting apache spark with apache hive remotely.

Please note that beeline also connects through jdbc. from your log it self its evident.

[ml@master spark-2.0.0]$./bin/beeline Beeline version 1.2.1.spark2 by Apache Hive beeline> !connect jdbc:hive2://remote_hive:10000

Connecting to jdbc:hive2://remote_hive:10000

So please have a look at this interesting article

  • Method 1: Pull table into Spark using JDBC
  • Method 2: Use Spark JdbcRDD with HiveServer2 JDBC driver
  • Method 3: Fetch dataset on a client side, then create RDD manually

Currently HiveServer2 driver doesn't allow us to use "Sparkling" Method 1 and 2, we can rely only on Method 3

Below is example code snippet though which it can be achieved

Loading data from one Hadoop cluster (aka "remote") into another one (where my Spark lives aka "domestic") thru HiveServer2 JDBC connection.

import java.sql.Timestamp
import scala.collection.mutable.MutableList

case class StatsRec (
  first_name: String,
  last_name: String,
  action_dtm: Timestamp,
  size: Long,
  size_p: Long,
  size_d: Long
)

val conn: Connection = DriverManager.getConnection(url, user, password)
val res: ResultSet = conn.createStatement
                   .executeQuery("SELECT * FROM stats_201512301914")
val fetchedRes = MutableList[StatsRec]()
while(res.next()) {
  var rec = StatsRec(res.getString("first_name"), 
     res.getString("last_name"), 
     Timestamp.valueOf(res.getString("action_dtm")), 
     res.getLong("size"), 
     res.getLong("size_p"), 
     res.getLong("size_d"))
  fetchedRes += rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()




 // Basically we are done. To check loaded data:

println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)

这篇关于如何从 Spark 连接到远程配置单元服务器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆