无法创建由SparkSQL在dplyr.spark.hive包支持dplyr SRC [英] Can't create dplyr src backed by SparkSQL in dplyr.spark.hive package

查看:606
本文介绍了无法创建由SparkSQL在dplyr.spark.hive包支持dplyr SRC的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我发现了巨大的 dplyr.spark.hive 包,它使 dplyr 前端操作与火花蜂巢后端。

Recently I found out about great dplyr.spark.hive package that enables dplyr frontend operations with spark or hive backend .

有是如何在包装的安装该软件包的信息自述

options(repos = c("http://r.piccolboni.info", unlist(options("repos"))))
install.packages("dplyr.spark.hive")

,也有关于如何使用 dplyr.spark.hive 工作当一个人已经连接到 hiveServer - 检查此

and there are also many examples on how to work with dplyr.spark.hive when one is already connected to hiveServer - check this.

但我不能够连接到 hiveServer ,所以我不能从这个包的巨大力量......

But I am not able to connect to hiveServer, so I can not benefit from the great power of this package...

我试过这样的命令,但他们没有工作了。有没有人有什么我做错了任何解决方案或意见?

I've tried such commands, but they did not work out. Does anyone have any solution or comment on what am I doing wrong?

> library(dplyr.spark.hive, 
+         lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’ 
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’ 
> 
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> 
> my_db = src_SparkSQL()
Error in .jfindClass(as.character(driverClass)[1]) : class not found
> 
> my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl',
+                      port = 10000)
Error in .jfindClass(as.character(driverClass)[1]) : class not found
> 
> my_db = src_SparkSQL(start.server = TRUE)
Error in start.server() : 
  Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580.  Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1 
> 
> my_db = src_SparkSQL(start.server = TRUE,
+                      list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client"))
Error in start.server() : 
  Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580.  Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1 

编辑2

我已经设置为系统变量,这样更多的路径,但现在我收到一个警告,告诉我,一些Java日志配置的未指定BU我认为这是

I have set more paths to system variables like this but now I receive a warning telling me that some kind of Java logging-configuration is not specified bu I think it is

> library(dplyr.spark.hive, 
+         lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’ 
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’ 
3: package ‘SparkR’ was built under R version 3.2.1 
> 
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HADOOP_HOME="/usr/share/hadoop")
> Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop")
> Sys.setenv(PATH='/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/share/hadoop/bin:/opt/hive/bin')
> 
> 
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

我的日志属性是不是空的。

My log properties are not empty.

-bash-4.2$ wc /etc/hadoop/log4j.properties 
 179  432 6581 /etc/hadoop/log4j.properties

编辑3

我要确切的通话 scr_SparkSQL()

> detach("package:SparkR", unload=TRUE)
Warning message:
package ‘SparkR’ was built under R version 3.2.1 
> detach("package:dplyr", unload=TRUE)
> library(dplyr.spark.hive, lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’ 
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’ 
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

和则PROCES不停止(从来没有)。
如果这些设置直线的工作与这样PARAMS:

And then the proces does not stop (never). Where those settings work for beeline with such params:

beeline  -u "jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl" -n mkosinski --outputformat=tsv --incremental=true -f sql_statement.sql > sql_output

但我不能够通过用户名称, DBNAME src_SparkSQL()
所以我试图用人工从函数内部的code,但我收到的SAM问题,即低于code也没有完成

but I am not able to pass user name and dbname to src_SparkSQL() so I have tried to manual use the code from inside that function but I receive the sam problem that the below code also does not finish

host = 'tools-1.hadoop.srv'
port = 10000
driverclass = "org.apache.hive.jdbc.HiveDriver"
Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
library(RJDBC)
dr = JDBC(driverclass, Sys.getenv("HADOOP_JAR"))
url = paste0("jdbc:hive2://", host, ":", port)
class = "Hive"
con.class = paste0(class, "Connection") # class = "Hive"
# dbConnect_retry =
#   function(dr, url, retry){
#     if(retry > 0)
#       tryCatch(
#         dbConnect(drv = dr, url = url),
#         error =
#           function(e) {
#             Sys.sleep(0.1)
#             dbConnect_retry(dr = dr, url = url, retry - 1)})
#     else dbConnect(drv = dr, url = url)}
#################
##con = new(con.class, dbConnect_retry(dr, url, retry = 100))
#################
con = new(con.class, dbConnect(dr, url, user = "mkosinski", dbname = "loghost"))

也许网​​址还应containg /日志主机 - 在 DBNAME

Maybe the url should containg also /loghost - the dbname?

推荐答案

有一个问题,我我以前不指定正确的,需要在<$ C CLASSPATH $ C> JDBC 创建一个驱动程序的功能。在 dplyr.spark.hive 包装参数的 CLASSPATH 通过 HADOOP_JAR 全局变量。

There was a problem that I did't specify the proper classPath that was needed inside JDBC function that created a driver. Parameters to classPath in dplyr.spark.hive package are passed via HADOOP_JAR global variable.

要使用 JDBC 作为司机 hiveServer2 (通过节俭协议),一是需要增加至少这3个 .jar文件的Java 类来创建一个正确的驱动程序

To use JDBC as a driver to hiveServer2 (through the Thrift protocol) one need to add at least those 3 .jars with Java classes to create a proper driver


  • 蜂巢-JDBC-1.0.0-standalone.jar

  • 的Hadoop / common / lib目录/公共配置-1.6.jar

  • 的Hadoop /普通/ Hadoop的共2.4.1.jar

版本是任意的,应符合当地蜂巢的hadoop 和<$ C $安装的版本兼容C> hiveServer2

versions are arbitrary and should be compatible with the installed version of local hive, hadoop and hiveServer2.

他们需要用 .Platform $ path.sep (所描述的这里

They need to be set with the .Platform$path.sep (as described here)

classPath = c("system_path1_to_hive/hive/lib/hive-jdbc-1.0.0-standalone.jar",
                  "system_path1_to_hadoop/hadoop/common/lib/commons-configuration-1.6.jar",
                   "system_path1_to_hadoop/hadoop/common/hadoop-common-2.4.1.jar")
Sys.setenv(HADOOP_JAR= paste0(classPath, collapse=.Platform$path.sep)

然后当 HADOOP_JAR 设置一个人均要小心与 hiveServer2 URL。在我的情况下,它必须是

Then when HADOOP_JAR is set one have to be carefull with hiveServer2 url. In my case it had to be

host = 'tools-1.hadoop.srv'
port = 10000
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")

和最后用 hiveServer2 使用 RJDBC 包的正确连接

and finally the proper connection with hiveServer2 using RJDBC package is

Sys.setenv(HADOOP_HOME="/usr/share/hadoop/share/hadoop/common/")
Sys.setenv(HIVE_HOME = '/opt/hive/lib/')
host = 'tools-1.hadoop.srv'
port = 10000
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
driverclass = "org.apache.hive.jdbc.HiveDriver"
library(RJDBC)
.jinit()
dr2 = JDBC(driverclass,
           classPath = c("/opt/hive/lib/hive-jdbc-1.0.0-standalone.jar",
                         #"/opt/hive/lib/commons-configuration-1.6.jar",
                         "/usr/share/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar",
                         "/usr/share/hadoop/share/hadoop/common/hadoop-common-2.4.1.jar"),
           identifier.quote = "`")

url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
dbConnect(dr2, url, username = "mkosinski") -> cont

这篇关于无法创建由SparkSQL在dplyr.spark.hive包支持dplyr SRC的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆