如何通过R访问HDFS? [英] How to acess to HDFS via R?

查看:126
本文介绍了如何通过R访问HDFS?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我正在尝试通过Windows计算机上的R远程连接到HDFS服务器.

So, I am trying to connect to a HDFS server via R remotely on a Windows machine.

但是,我将RStudio与"rhdfs"软件包一起使用,并且由于必须创建HADOOP_CMD环境变量,因此我将Hadoop下载到了我的机器上以提供环境变量并更改核心站点. xml.

I use RStudio with the "rhdfs" package, however, and since I had to create the HADOOP_CMD environment variable, I downloaded the Hadoop to my machine in order to give the environment variables, and change the core-site.xml.

以前,我尝试过将Kerberized Hive服务器与Keytab进行成功连接.

Previously I tried, sucessfully a connection the Kerberized Hive server with a Keytab.

这是我的代码:

Sys.setenv(HADOOP_STREAMING = 
"C:/Users/antonio.silva/Desktop/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar")
Sys.setenv(HADOOP_CMD = 
"C:/Users/antonio.silva/Desktop/hadoop-2.7.3/bin/hadoop")
Sys.setenv(HADOOP_HOME = 
"C:/Users/antonio.silva/Desktop/hadoop-2.7.3")
Sys.getenv("HADOOP_STREAMING")
Sys.getenv("HADOOP_CMD")
Sys.getenv("HADOOP_HOME")

#loading libraries
library(rJava)
library(rmr2)
library(rhdfs)

#init of the classpath 
hadoop.class.path <- list.files(path = c("C:/Users/antonio.silva/Desktop/jars/hadoop/"), 
pattern = "jar", full.names = T)
.jinit(classpath=hadoop.class.path)

hdfs.init()

执行hdfs.init()方法并执行hdfs.defaluts()之后,fs变量和工作目录位于同一目录.

After perform the hdfs.init() method and perform the hdfs.defaluts(), the fs variable and the working directore are the same directory.

我做错了什么?

推荐答案

我想出了一个解决方案.

i figured out a solution to this.

如果服务器具有Kerberos身份验证方法,则密钥表身份验证对于访问服务器很有用. 请参阅如何通过R与HIVE连接使用Kerberos keytab?.

If the server has the Kerberos authentication method, a keytab authentication can be useful to access the server. See How to connect with HIVE via R with Kerberos keytab?.

此后,需要下载到您的计算机(在本例中为Windows计算机),该计算机与群集中存在的Hadoop版本相同,并将Hadoop放置在Windows目录中.

After that, it is need to Download to your machine, in this case, a Windows machine, the same version of Hadoop present in the cluster and place the Hadoop a Windows directory.

然后,要配置Hadoop,您需要执行以下步骤,直到"Hadoop配置"为止. 在Windows 10上逐步安装Hadoop 2.8.0

Then, to configure the Hadoop you need to follow these steps until the point "Hadoop Configuration". Step by step Hadoop 2.8.0 installation on Window 10

集群中的Hadoop包含一些将在本地计算机上使用的配置文件.这些文件是core-site.xml,yarn-site.xml,hdfs-site.xml.它们包含有关群集的信息,例如默认FS,群集中使用的凭据类型,主机名和使用的端口.

The Hadoop in the cluster contain some configuration files that will be used in your local machine. The files are the core-site.xml, yarn-site.xml, hdfs-site.xml. They contain the information about the cluster, such as the default FS, what type of credentials used in the Cluster, the Hostname and the Port used.

其他:要在连接到Datanode时使用主机名,您需要在hdfs-site.xml文件中添加这些行.

Additional: To use the Hostnames when connecting to Datanodes, you need to add these lines in the hdfs-site.xml file.

<property>
  <name>dfs.client.use.datanode.hostname</name>
  <value>true</value>
  <description>Whether clients should use datanode hostnames when
connecting to datanodes.
</description>
  </property>

最后,在R中,使用以下代码执行连接:

Finally, in R use the following code to perform the connection:

#set The Environment variables in R
Sys.setenv(HADOOP_HOME = "C:/Users/antonio.silva/Desktop/hadoop-2.7.3/bin/winutils.exe")
Sys.setenv(HADOOP_CMD = "C:/Users/antonio.silva/Desktop/hadoop-2.7.3/bin/hadoop")
Sys.setenv(HADOOP_STREAMING = "C:/Users/antonio.silva/Desktop/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar")

library(rhdfs)


hdfs.init()
hdfs.ls("/")

所有这些都需要执行到Kerberized Hadoop集群的连接.

And that all needed to perform the connection to an Kerberized Hadoop Cluster.

这篇关于如何通过R访问HDFS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆