在任何地方导入 spark 隐式的解决方法 [英] Workaround for importing spark implicits everywhere

查看:34
本文介绍了在任何地方导入 spark 隐式的解决方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Spark 2.0 的新手,我在我们的代码库中使用了数据集.我有点注意到我需要在我们的代码中到处import spark.implicits._.例如:

I'm new to Spark 2.0 and using datasets in our code base. I'm kinda noticing that I need to import spark.implicits._ everywhere in our code. For example:

File A
class A {
    def job(spark: SparkSession) = {
        import spark.implcits._
        //create dataset ds
        val b = new B(spark)
        b.doSomething(ds)
        doSomething(ds)
    }
    private def doSomething(ds: Dataset[Foo], spark: SparkSession) = {
        import spark.implicits._
        ds.map(e => 1)            
    }
}

File B
class B(spark: SparkSession) {
    def doSomething(ds: Dataset[Foo]) = {
        import spark.implicits._
        ds.map(e => "SomeString")
    }
}

我想问的是是否有一种更简洁的方法可以做到

What I wanted to ask is if there's a cleaner way to be able to do

ds.map(e => "SomeString")

没有在我做地图的每个函数中导​​入隐式?如果我不导入它,我会收到以下错误:

without importing implicits in every function where I do the map? If I don't import it, I get the following error:

错误:(53, 13) 无法找到存储在数据集中的类型的编码器.导入 spark.implicits 支持原始类型(Int、String 等)和产品类型(case 类).未来版本中将添加对序列化其他类型的支持.

Error:(53, 13) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.

推荐答案

classobject 中进行导入,而不是在每个功能.对于您的文件 A"和文件 B"示例:

Something that would help a bit would be to do the import inside the class or object instead of each function. For your "File A" and "File B" examples:

File A
class A {
    val spark = SparkSession.builder.getOrCreate()
    import spark.implicits._

    def job() = {
        //create dataset ds
        val b = new B(spark)
        b.doSomething(ds)
        doSomething(ds)
    }

    private def doSomething(ds: Dataset[Foo]) = {
        ds.map(e => 1)            
    }
}

File B
class B(spark: SparkSession) {
    import spark.implicits._

    def doSomething(ds: Dataset[Foo]) = {    
        ds.map(e => "SomeString")
    }
}

通过这种方式,您可以获得可管理数量的导入.

In this way, you get a manageable amount of imports.

不幸的是,据我所知,没有其他方法可以进一步减少进口数量.这是因为在执行实际的 import 时需要 SparkSession 对象.因此,这是可以做到的最好的.

Unfortunately, to my knowledge there is no other way to reduce the number of imports even more. This is due to the need to the SparkSession object when doing the actual import. Hence, this is the best that can be done.

更新:

一个更方便的方法是创建一个 Scala Trait 并将它与一个空的 Object 结合起来.这允许在每个文件的顶部轻松导入隐式,同时允许扩展特征以使用 SparkSession 对象.

An even more convinient method is to create a Scala Trait and combine it with an empty Object. This allows for easy import of implicits at the top of each file while allowing extending the trait to use the SparkSession object.

示例:

trait SparkJob {
  val spark: SparkSession = SparkSession.builder.
    .master(...)
    .config(..., ....) // Any settings to be applied
    .getOrCreate()
}

object SparkJob extends SparkJob {}

有了这个,我们可以对文件 A 和 B 执行以下操作:

With this we can do the following for File A and B:

文件 A:

import SparkJob.spark.implicits._
class A extends SparkJob {
  spark.sql(...) // Allows for usage of the SparkSession inside the class
  ...
}

文件 B:

import SparkJob.spark.implicits._
class B extends SparkJob {
  ...    
}

请注意,仅需要为使用 spark 对象本身的类或对象扩展 SparkJob.

Note that it's only necessary to extend SparkJob for for the classes or objects that use the spark object itself.

这篇关于在任何地方导入 spark 隐式的解决方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆