我怎么可以分区pyspark RDDS控股R里面的函数 [英] How can I partition pyspark RDDs holding R functions

查看：379 发布时间：2016/5/22 15:39:15 python r apache-spark pyspark rpy2

本文介绍了我怎么可以分区pyspark RDDS控股R里面的函数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

import rpy2.robjects as robjects

dffunc = sc.parallelize([(0,robjects.r.rnorm),(1,robjects.r.runif)])
dffunc.collect()

输出

[(0, <rpy2.rinterface.SexpClosure - Python:0x7f2ecfc28618 / R:0x26abd18>), (1, <rpy2.rinterface.SexpClosure - Python:0x7f2ecfc283d8 / R:0x26aad28>)]

虽然分区版本将导致错误：

While the partitioned version results in an error:

dffuncpart = dffunc.partitionBy(2)
dffuncpart.collect()

RuntimeError: ('R cannot evaluate code before being initialized.', <built-in function unserialize>

好像这个错误是研究未加载上的分区，这是我认为的一个暗示，这不是执行的第一个导入步骤。反正是有解决这个？

It seems like this error is that R wasn't loaded on one of the partitions, which I assume implies that the first import step was not performed. Is there anyway around this?

修改1 第二个例子使我觉得有一个在pyspark或rpy2时机的错误。

EDIT 1 This second example causes me to think there's a bug in the timing of pyspark or rpy2.

dffunc = sc.parallelize([(0,robjects.r.rnorm),     (1,robjects.r.runif)]).partitionBy(2)
def loadmodel(model):
    import rpy2.robjects as robjects
    return model[1](2)
dffunc.map(loadmodel).collect()

生成初始化之前相同的错误R 1不能评估code。

Produces the same error R cannot evaluate code before being initialized.

dffuncpickle = sc.parallelize([(0,pickle.dumps(robjects.r.rnorm)),(1,pickle.dumps(robjects.r.runif))]).partitionBy(2)
def loadmodelpickle(model):
    import rpy2.robjects as robjects
    import pickle
    return pickle.loads(model[1])(2)
dffuncpickle.map(loadmodelpickle).collect()

作品一样的预期。

Works just as expected.

推荐答案

我想说，这是不是在rpy2一个错误，这是一个功能，但我会实事求是地有定居这是一个限制。

I'd like to say that "this is not a bug in rpy2, this is a feature" but I'll realistically have to settle with "this is a limitation".

正在发生的事情是，rpy2已经2 接口电平。一个是低一级（接近至R的C API），并可以通过 rpy2.rinterface ，另一种是与更花俏一个高层次的界面，更Python化，并与类的R对象从 rinterface 继承级别的人（即最后一部分是关于酸洗以下部分重要）。导入在初始化（启动）使用默认参数，如果必要的嵌入式R中的高级别接口的结果。导入低级界面 rinterface 没有这种副作用，必须明确执行的嵌入式R初始化（函数 initr ）。 rpy2设计这种方式，因为嵌入式R初始化可以有参数：输入第一个 rpy2.rinterface ，设置初始化，然后导入 rpy2.robjects 使这成为可能。

What is happening is that rpy2 has 2 interface levels. One is a low-level one (closer to R's C API) and available through rpy2.rinterface and the other one is a high-level interface with more bells and whistles, more "pythonic", and with classes for R objects inheriting from rinterface level-ones (that last part is important for the part about pickling below). Importing the high-level interface results in initializing (starting) the embedded R with default parameters if necessary. Importing the low-level interface rinterface does not have this side effect and the initialization of the embedded R must be performed explicitly (function initr). rpy2 was designed this way because the initialization of the embedded R can have parameters: importing first rpy2.rinterface, setting the initialization, then importing rpy2.robjects makes this possible.

在除了是R对象由rpy2包裹序列化（酸洗）目前只在 rinterface 级别定义（见的文档）。酸洗 robjects -level（高级别）rpy2对象是使用 rinterface -level code和在unpickle时他们，他们将保持在较低的水平，（Python的泡菜中含有的类的对象中定义，将导入该模块的模块 - 在这里 rinterface ，这并不意味着嵌入式R初始化）。之所以事情都是这样，只是说这是不够好，现在的：这是实现我不得不同时想弥合两个有些不同的语言，并通过Python的C-API学习我的方式的好方法的时间和酸洗/取储存Python对象。鉴于有哪一个可以写类似的难易程度。

In addition to that the serialization (pickling) of R objects wrapped by rpy2 is currently only defined at the rinterface level (see the documentation). Pickling robjects-level (high-level) rpy2 objects is using the rinterface-level code and when unpickling them they will remain at that lower-level (the Python pickle contains the module the class of the object is defined in and will import that module - here rinterface, which does not imply the initialization of the embedded R). The reason for things being this way are simply that it was "good enough for now": at the time this was implemented I had to simultaneously think of a good way to bridge two somewhat different languages and learn my way through Python C-API and pickling/unpickling Python objects. Given the ease with which one can write something like

import rpy2.robjects

或

import rpy2.rinterface
rpy2.rinterface.initr()

在unpickle之前，这是从来没有重新审查。 rpy2的酸洗我知道正在使用Python的多（并添加类似于code导入语句初始化一个子进程东西的用途是廉价和充足的修复）。也许这是再次看一下这个时间。提交错误报告rpy2如果案件。

before unpickling, this was never revisited. The uses of rpy2's pickling I know about are using Python's multiprocessing (and adding something similar to the import statements in the code initializing a child process was a cheap and sufficient fix). May this is the time to look at this again. File a bug report for rpy2 if the case.

编辑：这无疑是与rpy2的问题。腌制 robjects -level对象应该unpickle回 robjects -level，而不是 rinterface -level。我在rpy2跟踪器打开了一个问题（并已推在缺省的/ dev分支一个基本的补丁）。

edit: this is undoubtedly an issue with rpy2. pickled robjects-level objects should unpickle back to robjects-level, not rinterface-level. I have opened an issue in the rpy2 tracker (and already pushed a rudimentary patch in the default/dev branch).

2日编辑：的补丁开始2.7.7版本发布rpy2的一部分（在写作的时候最新的版本是2.7.8）。

2nd edit: The patch is part of released rpy2 starting with version 2.7.7 (latest release at the time of writing is 2.7.8).

这篇关于我怎么可以分区pyspark RDDS控股R里面的函数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

我怎么可以分区pyspark RDDS控股R里面的函数 [英] How can I partition pyspark RDDs holding R functions

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

我怎么可以分区pyspark RDDS控股R里面的函数 [英] How can I partition pyspark RDDs holding R functions

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭