如何使用内部Pyspark Scala的类 [英] How to use a Scala class inside Pyspark

查看：421 发布时间：2016/5/22 15:32:40 python scala apache-spark pyspark spark-dataframe

本文介绍了如何使用内部Pyspark Scala的类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在寻找了一段时间，如果有任何方式使用 Pyspark A 斯卡拉类，我还没有发现有关此主题的任何文件，也没有指导。

I've been searching for a while if there is any way to use a Scala class in Pyspark, and I haven't found any documentation nor guide about this subject.

让我们说我创建斯卡拉一个简单的类，它使用 Apache的火花，类似的一些库：

Let's say I create a simple class in Scala that uses some libraries of apache-spark, something like:

class SimpleClass(sqlContext: SQLContext, df: DataFrame, column: String) {
  def exe(): DataFrame = {
    import sqlContext.implicits._

    df.select(col(column))
  }
}

是否有 Pyspark ？

是不是太强硬？

请我一定要创建一个的.py 文件？

有没有说明如何做到这一点？

Is there any possible way to use this class in Pyspark?
Is it too tough?
Do I have to create a .py file?
Is there any guide that shows how to do that?

这是我又看了看火花 code，我觉得有点失落，我是不能复制的功能，为我自己的目的，方式

By the way I also looked at the spark code and I felt a bit lost, and I was incapable of replicating their functionality for my own purpose.

推荐答案

是的，它是可能的，虽然可以远离琐碎。通常你想一个Java（友好的）包装，所以你不必应付斯卡拉功能不能轻易恩$ P $使用纯Java和结果pssed不Py4J网关打好。

Yes it is possible although can be far from trivial. Typically you want a Java (friendly) wrapper so you don't have to deal with Scala features which cannot be easily expressed using plain Java and as a result don't play well with Py4J gateway.

假设你的类为int包 com。示例和安装Python 数据帧名为 DF

Assuming your class is int the package com.example and have Python DataFrame called df

df = ... # Python DataFrame

你必须：

使用您喜欢的构建工具建立一个罐子。

将其包含在例如使用驱动程序类路径 - 驱动程序类路径论据PySpark壳/ 火花提交。使用取决于具体的code，你可能要通过它 - 罐子以及

Include it in the driver classpath for example using --driver-class-path argument for PySpark shell / spark-submit. Depending on the exact code you may have to pass it using --jars as well

这是一个Python SparkContext 实例提取JVM实例：

Extract JVM instance from a Python SparkContext instance:

jvm = sc._jvm

提取斯卡拉 SQLContext 从 SQLContext 实例：
```
ssqlContext = sqlContext._ssql_ctx
```

Java的提取数据帧从东风：
```
jdf = df._jdf
```

创建的新实例 SimpleClass ：

simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v")

呼叫 exe文件方法，敷使用Python结果数据帧：

Callexe method and wrap the result using Python DataFrame:

from pyspark.sql import DataFrame

DataFrame(simpleObject.exe(), ssqlContext)

结果应该是一个有效的PySpark 数据帧。当然，你可以将所有的步骤合并成一个单一的电话。

The result should be a valid PySpark DataFrame. You can of course combine all the steps into a single call.

重要提示：仅在Python的code完全是驾驶员执行这种方法是可行的。它不能Python的操作或变换内使用。见如何使用Java / Scala的功能，从一个动作或一个转型？了解详情。

Important: This approach is possible only if Python code is executed solely on the driver. It cannot be used inside Python action or transformation. See How to use Java/Scala function from an action or a transformation? for details.

这篇关于如何使用内部Pyspark Scala的类的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用内部Pyspark Scala的类 [英] How to use a Scala class inside Pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用内部Pyspark Scala的类 [英] How to use a Scala class inside Pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭