使用sparkr当工作节点,我应该pre-安装CRAN的R程序包 [英] should I pre-install cran r packages on worker nodes when using sparkr

查看:100
本文介绍了使用sparkr当工作节点,我应该pre-安装CRAN的R程序包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在CRAN使用R封装,如预测等具有sparkr并满足以下两个问题。


  1. 我应该pre-安装在工作节点所有的包呢?但是,当我看到火花的来源$ C ​​$ C <一个href=\"https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/RPackageUtils.scala\"相对=nofollow>此文件,似乎火花会自动压缩包,并通过--jars或--packages分发给工人。我应该怎么做才能让工人提供的依赖?


  2. 假设我需要使用地图转型提供预测的功能,我应该怎么导入包。我需要做这样的事情之后,在地图功能导入包,将它进行多次导入:

    SparkR :::地图(RDD,功能(X){
      库(预测)
      然后做其他工作人员
    })


更新:

阅读更多的源$ C ​​$ C后,似乎,我可以使用 includePackage 根据的此文件。所以,现在的问题变成是不是我手动不得不pre-安装的软件包上的节点?如果这是真的,有什么用案例问题1中所述--jars和--packages?如果这是错的,如何使用--jars和--packages安装包?


解决方案

这是无聊重复这一点,但您不应该使用在第一位内部RDD API 。它在第一次正式发布SparkR被删除,这是根本不适合一般用途。

直到新的底层API准备(例如,见 SPARK-12922 SPARK-12919 ,的 SPARK-12792 )我不会考虑星火作为运行纯R code进行平台。即使当它改变添加原生(Java /斯卡拉)code有R包装可能是一个更好的选择。

话虽这么说,让你的问题开始:


  1. RPackageUtils 旨在处理封装,Spark在脑海包创建。它不处理标准的R库。

  2. 是的,你必须在每个节点上安装的软件包。从 includePackage 文档字符串:


      

    假设该包被星火集群中的每个节点上安装。



I want to use r packages on cran such as forecast etc with sparkr and meet following two problems.

  1. Should I pre-install all those packages on worker nodes? But when I read the source code of spark this file, it seems that spark will automatically zip packages and distribute them to the workers via --jars or --packages. What should I do to make the dependencies available on workers?

  2. Suppose I need to use functions provided by forecast in a map transformation, how should I import the package. Do I need to do something like following, import the package in the map function, will it make multiple import: SparkR:::map(rdd, function(x){ library(forecast) then do other staffs })

Update:

After reading more source code, it seems that, I can use includePackage to include packages on worker nodes according to this file. So now the problem becomes is it right that I have to pre-install the packages on nodes manually? And if that's true, what's the use case for --jars and --packages described in question 1? If that's wrong, how to use --jars and --packages to install the packages?

解决方案

It is boring to repeat this but you shouldn't use internal RDD API in the first place. It's been removed in the first official SparkR release and it is simply not suitable for general usage.

Until new low level API is ready (see for example SPARK-12922 SPARK-12919, SPARK-12792) I wouldn't consider Spark as a platform for running plain R code. Even when it changes adding native (Java / Scala) code with R wrappers can be a better choice.

That being said lets start with your question:

  1. RPackageUtils are designed to handle packages create with Spark Packages in mind. It doesn't handle standard R libraries.
  2. Yes, you need packages to be installed on every node. From includePackage docstring:

    The package is assumed to be installed on every node in the Spark cluster.

这篇关于使用sparkr当工作节点,我应该pre-安装CRAN的R程序包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆