在GCP上连接黑白R Studio Studio Pro和Hive [英] Connection b/w R studio server pro and hive on GCP

查看:273
本文介绍了在GCP上连接黑白R Studio Studio Pro和Hive的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这不是与编程相关的问题,请耐心等待.

This is not a programming related question, please bear with me on this.

我目前在GCP上设置了两个实例-一个是R studio服务器PRO,另一个是我的带有Hive DB的集群.我希望使用我的Rstudio Server Pro在蜂巢中访问数据库.两者都在GCP上运行.

I have currently set up two instances on GCP - one is R studio server PRO and the other is my cluster with Hive DB. I wish to access the database in hive with my rstudio server pro. Both being run on GCP.

有人可以指导我吗? (我看过有关rstudio桌面---> hive连接以及从spark集群中运行rstudio-server的文章,但是我必须将Rstudio Server PRO链接到hive db,两者都在GCP:O上运行)

Can anyone please guide me on this ? (I have seen articles for rstudio desktop ---> hive connection and also for running rstudio-server from within a spark cluster, but i have to link Rstudio server PRO to hive db, both running on GCP :O )

推荐答案

供以后参考:R studio-Dataproc-

在这种情况下,我将数据从HiveDB推送到Spark中,并使用sparklyr包在同一群集内的R Studio服务器中建立连接.如果您希望直接与Hive连接,也可以检查"Hive-R-JDBC"连接.

In this particular case, I am pushing data from HiveDB into Spark and using sparklyr package to establish a connection in R studio server within the same cluster. You may also check "Hive-R-JDBC" connection, if you wish to directly connect with Hive.

GCP在计算引擎上提供了R Studio服务器PRO,但是它不具有成本效益.我用了大约8小时,并被收取约21美元.每周5天,您​​的收入要> $100.我希望以下步骤对您有所帮助:

GCP offers R studio server PRO on compute engine, but it is not cost efficient. I had used it for about 8hrs and was billed $21 approx. 5 days a week and you're looking at > $ 100. I hope the following steps will help you :

R studio在端口8787上运行.您必须将此端口添加到防火墙网络规则中.滚动到GCP中的汉堡包图标,然后向下滚动到VPC网络,单击防火墙规则并添加8787.事后看起来像这样

R studio runs on port 8787. You will have to add this port to your firewall network rule. Scroll over to hamburger icon in your GCP and scroll down to VPC Networks, click on firewall rules and add 8787. It should look like this afterwards

根据您的要求和位置设置一个dataproc集群.然后通过SSH进入浏览器窗口或通过gcloud命令行运行.提示在云shell中运行时,只需按Enter.

Set up a dataproc cluster based on your requirements and location. And then either SSH into browser window or run though gcloud command line. Just press enter when its prompts to run in cloud shell.

一旦您进入window/gcloud命令行,请为R服务器添加一个用户:

Once you are in the window/gcloud command line, add a user for R server:

 sudo adduser rstudio 

为其设置密码.记住它.

Set a password for it. Remember it.

接下来访问R Studio网站,链接:

Next go the R studio website, link :

返回到您的窗口/命令行并安装它.像这样在sudo wget之后粘贴链接地址:

Go back to your window/command line and install it. Paste the link address after sudo wget like so :

sudo wget https://s3.amazonaws.com/rstudio-ide-build/server/trusty/amd64/rstudio-server-1.2.650-amd64.deb

然后运行:

sudo apt-get install gdebi-core

后跟:请注意,这是上面链接中的r版本.

Followed by : Note this is the r version from link above.

sudo gdebi rstudio-server-1.2.650-amd64.deb

按yes接受,您应该看到消息R服务器处于活动状态(正在运行). 现在,导航到GCP中的计算引擎"选项卡,并复制主群集的外部IP(第一个).现在打开一个新的浏览器,然后输入:

Press yes to accept and you should see a message R server active (running). Now navigate to Compute Engine tab in GCP and copy the external IP of your master cluster (first one). Now open a new browser and enter :

http://<yourexternalIPaddress>:8787 

这应该打开R Studio服务器,现在输入使用的ID作为"rstudio"和您之前设置的密码.现在,您已经从数据proc群集中启动并运行了R Studio服务器.

This should open R studio server, now enter the used id as "rstudio" and the password which you set up earlier. Now you have R studio server up and running from your data proc cluster.

**蜂巢**:

返回终端并输入

     beeline -u jdbc:hive2://localhost:10000/default -n *myusername*@*clustername-m* -d org.apache.hive.jdbc.HiveDriver  

我们将从HDFS(即Google云存储)将数据导入Hive.在这里,我们只是将存储桶中的数据复制到配置单元表中.输入命令:

We shall import data into Hive from our HDFS i.e Google cloud storage. Here we are simply copying the data from our bucket into our hive table. Enter command :

 CREATE EXTERNAL TABLE <giveatablename>
    (location CHAR(1),
     dept CHAR(1),
     eid INT,
     emanager VARCHAR(6))
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 LOCATION 'gs://<yourgooglestoragebucket>/<foldername>/<filename.csv>';

现在,您在Hive中的表 您的表名 中具有功能->位置,部门,eid和emanager->来自Google云端存储中的csv文件-> gs ://

Now you have a table in Hive yourtablename with features -> location, dept, eid and emanager -> from a csv file in your google cloud storage -> gs://

现在从配置单元(CTRL + Z)退出并输入:

Now exit from hive (CTRL+Z) and type in :

    ln -s /etc/hive/conf/hive-site.xml /etc/spark/conf/hive-site.xml

这是在配置单元中触发您的配置文件的链接.这样做比将文件复制到该位置更好.由于可能会有混乱.

This is a link to your configuration file in hive to spark. It is better to do this than to copy the files into the location. As there may be confusion.

火花:

通过键入以下内容登录到spark-shell:

Log into the spark-shell by typing:

     spark-shell 

现在输入:

    spark.catalog.listTables.show 

检查HiveDb中的表是否存在.

To check if table from your HiveDb is there or not.

现在转到Rstudio服务器浏览器并运行以下命令:

Now go to the Rstudio server browser and run the following commands :

  library(sparklyr)
  library(dplyr)
  sparklyr::spark_install()
  #config
  Sys.setenv(SPARK_HOME="/usr/lib/spark")
  config <- spark_config()
  #connect
  sc <- spark_connect(master="yarn-client",config = config,version="2.2.1")

现在在右侧,您将在环境旁边看到一个名为连接"的新标签.这是您的Spark群集连接,单击它,它应该显示来自Hive的 您的表名 .

Now to the right hand side, you will see a new tab called "Connection" next to Environment. This is your spark cluster connection, click on it and it should show yourtablename from Hive.

这篇关于在GCP上连接黑白R Studio Studio Pro和Hive的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆