在Hadoop节点上临时安装R包以进行流式作业 [英] Temporarily installing R packages on Hadoop nodes for streaming jobs

查看:97
本文介绍了在Hadoop节点上临时安装R包以进行流式作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以访问基于R(2.14.1)的Hadoop集群,但是没有在每个节点中安装额外的软件包。我一直在编写基本的R mapper和reducer流脚本来解决这个事实,即我没有额外的软件包。不过,我已经到了一个地步,我需要使用某些软件包,主要是rjson,作为我脚本的一部分。

I have access to a Hadoop cluster that has base R (2.14.1) but no additional packages installed in every node. I've been writing base R mapper and reducer streaming scripts to get around the fact that I have no additional packages. However, I've come to a point where I need to use certain packages, rjson mainly, as part of my scripts.

我没有管理员权限群集,并且用户帐户相当受限制。让集群管理员在每个节点上安装软件包不是一个选项(现在),并且集群没有外部Internet访问。

I don't have admin privileges on the cluster, and the user accounts are fairly restricted. Having the cluster admins install the package on every node is not an option (for now), and the cluster has no external internet access.

我已经上传了rjson_0。 2.8.tar.gz源文件到我的网关节点。是否可以通过添加 install.packages(rjson_0.2.8.tar.gz,repos = NULL,lib = / tmp)行,以便在脚本启动时安装包,并通过流作业的-cacheArchive参数传递源代码?我希望将这个软件包安装在一个临时的位置,这样它可以在作业完成时消失。

I've uploaded the rjson_0.2.8.tar.gz source file to my gateway node. Is it possible to install R packages temporarily by adding install.packages("rjson_0.2.8.tar.gz", repos = NULL, lib = /tmp) or something along those lines, such that the package is intalled when the script starts, and pass the source via the -cacheArchive parameter of the streaming job? I'd like the package to be installed in a temp location such that it dissapears when the job is complete.

这甚至有可能吗?

我知道我会得到一些使用python的答案,因为它用于处理JSON,这是一个选项,但问题在于任何包。 :)

I know I'll get some "use python" answers since it's for processing JSON, which is an option, but the question is for any package. :)

推荐答案

我是rmr(项目RHadoop)的作者。我们正在尝试一种非常激进的方法,以避免安装问题。我们将整个R发行版,软件包和所有内容封装在一个jar中,使用您所描述的流式特性,但具有一定程度的间接性。 R分发被加载到用户hdfs目录,而不是tmp目录。流然后将其移动到每个节点。只要工作不存在,工作本身就会将其移至最终目的地。我们这样做是因为整个发行版并不是很小,我们希望利用流式传输的缓存功能,再加上R的组件不可重定位。所以,当你更新某个东西或添加一个包时,你会重建jar并将它移动到hdfs。其余部分是自动的,只在需要时才会发生(hdfs-> nodes-> final location)。我甚至从Hortonworks家伙那里得到了一些指导,做对了。我们在分支0安装中有一个概念验证,但它仅适用于ubuntu / EC2,显然我设法硬编码了一些我不应该拥有的路径,并且我正在做一些其他假设,所以这只是对于愿意参与的开发者来说,但主要成分都已到位。当然,这是有条件的,你用rmr编写你的工作,这是一个单独的决定,或者你可以看看代码,并为你的目的重现方法。但我宁愿一劳永逸地为所有人解决这个问题。准备jar的脚本是这样的: https://github.com/RevolutionAnalytics/RHadoop/blob/0-install/rmr/pkg/tools/0-install/setup-jar ,其余的操作在rmr ::: rhstream

I am the author of rmr (project RHadoop). We are experimenting with a pretty radical approach to side step the installation issue. We package the whole R distribution, packages and everything in a jar, using the streaming features as you describe but with one degree of indirectness. The R distribution is loaded to a user hdfs directory, not a tmp directory. Streaming then moves it to each node. The job itself will move it to its final destination whenever it's not present already. We did so because the whole distro is not tiny and we wanted to take advantage of the caching features of streaming, plus components of R are not relocatable. So you would rebuild the jar and move it to hdfs whenever you update something or add a package. The rest is automatic and happens only when needed (hdfs->nodes->final location). I even got some coaching from the Hortonworks guys to do it right. We have a proof of concept in the branch 0-install, but it works only for ubuntu/EC2 and apparently I managed to hard code some paths that I shouldn't have and I am making a number of other assumptions, so this is only for developers willing to chip in, but the main ingredients are all in place. Of course this is conditional to you writing your jobs with rmr, which is a separate decision, or you could just take a look at the code and reproduce the approach for your purposes. But I'd rather have this solved once and for all for everybody. The script preparing the jar is this: https://github.com/RevolutionAnalytics/RHadoop/blob/0-install/rmr/pkg/tools/0-install/setup-jar and the rest of the action is in rmr:::rhstream

这篇关于在Hadoop节点上临时安装R包以进行流式作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆