Apache中的星火基于R [英] Using R in Apache Spark

查看:146
本文介绍了Apache中的星火基于R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有来访问R库的一些选项星火:

There are some options to access R libraries in Spark:


  • 直接使用 sparkr

  • 使用语言绑定如 rpy2 或<一个HREF =htt​​ps://cran.r-project.org/web/packages/rscala/index.html相对=nofollow> rscala

  • 使用独立的服务如 opencpu

  • directly using sparkr
  • using language bindings like rpy2 or rscala
  • using standalone service like opencpu

看起来SparkR是相当有限的,OpenCPU需要保持额外的服务和绑定可以有稳定的问题。还有其它一些具体的事情来Spark架构使用任何解决方案,这使得不容易的。

It looks like SparkR is quite limited, OpenCPU requires keeping additional service and bindings can have stability issue. Is there something else specific to Spark architecture which make using any solution not easy.

你有集研发方面的经验和Spark你可以分享?

Do you have any experience with integrating R and Spark you can share?

推荐答案

该项目的主要语言似乎是一个重要因素。

The main language for the project seems like an important factor.

如果 pyspark 是使用星火你(这意味着你是从Python的访问星火)通过<$ C访问R A好方法$ C> rpy2 不应使用任何其他Python库与C的扩展太大的差别。

If pyspark is a good way to use Spark for you (meaning that you are accessing Spark from Python) accessing R through rpy2 should not make much difference from using any other Python library with a C-extension.

有存在这样的用户报告(虽然偶尔问题,如<一个href=\"http://stackoverflow.com/questions/34669751/how-can-i-partition-pyspark-rdds-holding-r-functions\">How我可以分区pyspark RDDS拿着R的功能和或<一个href=\"http://stackoverflow.com/questions/34645130/can-i-connect-an-external-r-process-to-each-pyspark-worker-during-setup\">Can我安装在外部(R)过程连接到每个pyspark工人)

There exist reports of users doing so (although with occasional questions such as How can I partition pyspark RDDs holding R functions or Can I connect an external (R) process to each pyspark worker during setup)

如果R是你的主要语言,帮助与反馈或贡献SparkR作者,你觉得有限制会的路要走。

If R is your main language, helping the SparkR authors with feedback or contributions where you feel there are limitation would be way to go.

如果您的主要语言是斯卡拉, rscala 应该是你的第一次尝试。

If your main language is Scala, rscala should be your first try.

虽然组合 pyspark + rpy2 似乎最确定(如使用历史最悠久,大概试了最codeBase的),这样做并不一定意味着它是最好的解决办法(和年轻包能够迅速发展)。我估计第一件事就是该项目的preferred语言,并从那里尝试的选择。

While the combo pyspark + rpy2 would seem the most "established" (as in "uses the oldest and probably most-tried codebase"), this does not necessarily mean that it is the best solution (and young packages can evolve quickly). I'd assess first what is the preferred language for the project and try options from there.

这篇关于Apache中的星火基于R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆