如何在 Pyspark 中使用 Scala 类 [英] How to use a Scala class inside Pyspark

查看:32
本文介绍了如何在 Pyspark 中使用 Scala 类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在寻找是否有任何方法可以在 Pyspark 中使用 Scala 类,但我没有找到任何有关此的文档或指南主题.

I've been searching for a while if there is any way to use a Scala class in Pyspark, and I haven't found any documentation nor guide about this subject.

假设我在 Scala 中创建了一个简单的类,它使用了一些 apache-spark 库,例如:

Let's say I create a simple class in Scala that uses some libraries of apache-spark, something like:

class SimpleClass(sqlContext: SQLContext, df: DataFrame, column: String) {
  def exe(): DataFrame = {
    import sqlContext.implicits._

    df.select(col(column))
  }
}

  • 有没有可能在 Pyspark 中使用这个类?
  • 是不是太难了?
  • 我是否必须创建一个 .py 文件?
  • 是否有任何指南说明如何做到这一点?
    • Is there any possible way to use this class in Pyspark?
    • Is it too tough?
    • Do I have to create a .py file?
    • Is there any guide that shows how to do that?
    • 顺便说一下,我也看了 spark 代码,感觉有点迷茫,我无法为自己的目的复制它们的功能.

      By the way I also looked at the spark code and I felt a bit lost, and I was incapable of replicating their functionality for my own purpose.

      推荐答案

      是的,虽然可能远非微不足道.通常,您需要一个 Java(友好的)包装器,这样您就不必处理无法使用普通 Java 轻松表达的 Scala 功能,因此不能很好地与 Py4J 网关配合使用.

      Yes it is possible although can be far from trivial. Typically you want a Java (friendly) wrapper so you don't have to deal with Scala features which cannot be easily expressed using plain Java and as a result don't play well with Py4J gateway.

      假设你的类是包 com.example 并且有 Python DataFrame 叫做 df

      Assuming your class is int the package com.example and have Python DataFrame called df

      df = ... # Python DataFrame
      

      你必须:

      1. 使用您最喜欢的构建工具构建 jar.

      例如使用 PySpark shell/spark-submit--driver-class-path 参数将其包含在驱动程序类路径中.根据确切的代码,您可能还必须使用 --jars 传递它

      Include it in the driver classpath for example using --driver-class-path argument for PySpark shell / spark-submit. Depending on the exact code you may have to pass it using --jars as well

      从 Python SparkContext 实例中提取 JVM 实例:

      Extract JVM instance from a Python SparkContext instance:

      jvm = sc._jvm
      

    • SQLContext 实例中提取 Scala SQLContext:

    • Extract Scala SQLContext from a SQLContext instance:

      ssqlContext = sqlContext._ssql_ctx
      

    • df 中提取 Java DataFrame:

    • Extract Java DataFrame from the df:

      jdf = df._jdf
      

    • 创建SimpleClass的新实例:

      simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v")
      

    • 调用exe方法并使用PythonDataFrame包装结果:

    • Callexe method and wrap the result using Python DataFrame:

      from pyspark.sql import DataFrame
      
      DataFrame(simpleObject.exe(), ssqlContext)
      

    • 结果应该是一个有效的 PySpark DataFrame.您当然可以将所有步骤合并为一个调用.

      The result should be a valid PySpark DataFrame. You can of course combine all the steps into a single call.

      重要:仅当 Python 代码仅在驱动程序上执行时,此方法才可用.它不能在 Python 操作或转换中使用.有关详细信息,请参阅如何从操作或转换中使用 Java/Scala 函数?.

      Important: This approach is possible only if Python code is executed solely on the driver. It cannot be used inside Python action or transformation. See How to use Java/Scala function from an action or a transformation? for details.

      这篇关于如何在 Pyspark 中使用 Scala 类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆