互操作性:通过两种方式在Java和Scala之间共享对象或Row的数据集.我将Scala数据集操作放在Java中间 [英] Interoperability : sharing Datasets of objects or Row between Java and Scala, two ways. I put a Scala dataset operation in the middle of Java ones

查看:71
本文介绍了互操作性:通过两种方式在Java和Scala之间共享对象或Row的数据集.我将Scala数据集操作放在Java中间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我的主应用程序是用 Java Spring-boot 构建的,并且不会改变,因为它很方便.
@Autowired 服务bean实现,例如:

Currently, my main application is built with Java Spring-boot and this won't change because it's convenient.
@Autowired service beans implements, for example :

  • 企业机构数据集.第一个还能够返回具有其场所的 Map Enterprise 对象的列表.
    因此该服务返回: Dataset< Enterprise> Dataset< Establishment> Dataset< Row>
  • 关联:数据集<行>
  • 城市: Dataset< Commune> Dataset< Row>
  • 地方当局: Datatset< Row> .
  • Enterprise and establishment datasets. The first one is also able to return a list of Enterprise objects that have a Map of their establishments.
    So the service returns : Dataset<Enterprise>, Dataset<Establishment>, Dataset<Row>
  • Associations : Dataset<Row>
  • Cities : Dataset<Commune> or Dataset<Row>,
  • Local authorities : Datatset<Row>.

许多用例函数都是这种类型的调用:

Many user case functions are calls of this kind :

什么是协会(year = 2020)?

What are associations(year=2020) ?

我的应用程序转发到 datasetAssociation(2020),该数据库与企业和场所数据集以及城市和地方当局的数据集一起运行,以提供有用的结果.

And my applications forward to datasetAssociation(2020) that operates with enterprises and establishments datasets and with cities and local authorities ones to provide an useful result.

为此,我正在考虑涉及数据集之间其他操作的操作:

For this, I'm considering an operation involving other ones between datasets :

  • 一些由Row制成,
  • 一些携带混凝土物品的人.

就已到达/涉及的数据集而言,我需要执行此操作:
协会.企业.机构 .cities.localautorities

I have this operation to do, in term of datasets reached/involved :
associations.enterprises.establishments.cities.localautorities

  1. 使用 Java 代码构建的 Dataset< Row> 发送到 Scala 函数以完成.

  1. A Dataset<Row> built with Java code is sent to a Scala function to be completed.

Scala 使用 Enterprise Establishment 对象创建新的数据集.
a)如果对象的源代码是用 Scala 编写的,则不必在 Java 中为其重新创建新的源代码.
b)相反,如果对象的源代码是用 Java 编写的,则不必在 Scala 中重新创建新的源代码.
c)我可以直接使用此数据集在 Java 端返回的 Scala 对象.

Scala creates a new dataset with Enterprise and Establishment objects.
a) If the source of an object is written in Scala I don't have to recreate a new source for it in Java.
b) conversely if the source of an object is written in Java, I don't have to recreate a new source in Scala.
c) I can use a Scala object returned by this dataset on Java side directly.

Scala 将必须调用在 Java 中保持实现的函数,并将其创建的基础数据集发送给它们(例如,使用城市信息来完善它们).

Scala will have to call functions kept implemented in Java and send them the underlying dataset it is creating (for example to complete them with cities information).

Java 随时调用 Scala 方法
Scala 也可以随时调用 Java 方法:

操作可以遵循
Java->Scala->Scala->Java->Scala->Java->Java
如果需要,以所调用方法的本地语言表示.
因为我事先不知道我会发现哪些部分对在 Scala 中移植有用.

Java calls Scala methods at anytime
and Scala calls Java methods at anytime too :

an operation could follow a
Java -> Scala -> Scala -> Java -> Scala -> Java -> Java
path if wished, in term of native language of method called.
Because I don't know in advance what parts I will find useful to port in Scala or not.

完成这三点后,我将认为 Java Scala 可以实现两种方式的互操作性,并从另一种方式中受益.

Completing these three points, I will consider that Java and Scala are able interoperable the two way and benefit one from the other.

但是我是否可以实现这一目标(在 Spark 2.4.x 中或更可能在 Spark 3.0.0 中)?

But may I achieve this goal (in Spark 2.4.x or more probably in Spark 3.0.0) ?

  • 它不会使源代码的一侧或另一侧过于笨拙.甚至更糟:重复.
  • 它不会严重降低性能(例如,必须重新创建整个数据集或转换它包含的每个对象,例如,一侧或另一侧都是禁止的).

推荐答案

正如Jasper-M所写,scala和Java代码是完全可互操作的:

As Jasper-M wrote, scala and java code are perfectly inter-operable:

  • 它们都被编译成.class文件,这些文件由jvm以相同的方式执行
  • spark java和scala API可以一起使用,但有一些细节:
    • 两者都使用相同的Dataset类,所以在那里没有问题
    • 但是,SparkContext和RDD(以及所有RDD变体)具有scala api,这在Java中不实用.主要是因为scala方法将scala类型作为输入,而不是您在Java中使用的那些输入.但是它们都有Java包装器(JavaSparkContext,JavaRDD).在Java中进行编码,您可能已经看到了这些包装器.

    现在,正如许多人所建议的那样,首先将spark作为一个scala库,并且scala语言比Java(*)更强大,使用scala编写spark代码将更加容易.另外,您将在scala中找到更多代码示例.通常很难找到用于复杂数据集操作的Java代码示例.

    Now, as many have recommended, spark being a scala library first, and the scala language being more powerful than java (*), using scala to write spark code will be much easier. Also, you will find much more code-example in scala. It is often difficult to find java code example for complex Dataset manipulation.

    所以,我认为您应该注意的两个主要问题是:

    So, I think the two main issues you should be taking care of are:

    1. (与火花无关,但有必要)有一个可以同时编译两种语言并允许双向互操作性的项目.我认为sbt是开箱即用的,使用maven时,您需要使用scala插件,并(根据我的经验)将java和scala文件都放在java文件夹中.否则,一个可以调用另一个,但是不能相反(scala调用java,但是java不能调用scala,反之亦然)
    2. 您应注意每次创建类型化数据集(即 Dataset [YourClass] 而不是 Dataset< Row> )时使用的编码器.在Java中,对于Java模型类,您需要显式使用 Encoders.bean(YourClass.class).但是在scala中,默认情况下,spark隐式找到编码器,并且为scala案例类(产品类型")和scala标准集合构建编码器.因此,请注意使用了哪些编码器.例如,如果您在Scala中创建了YourJavaClass的数据集,我认为您可能必须明确给出 Encoders.bean(YourJavaClass.class)才能正常工作,而不会出现序列化问题.
    1. (not spark related, but necessary) have a project that compiles both language and allows two-way inter-operability. I think sbt provides it out-of-the-box, and with maven you need to use the scala plugin and (from my experience) put both java and scala files in the java folder. Otherwise one can call the other, but not the opposite (scala call java but java cannot call scala, or the other way around)
    2. You should be careful of the encoder that are used each time you create a typed Dataset (i.e. Dataset[YourClass] and not Dataset<Row>). In Java, and for java model classes, you need to use Encoders.bean(YourClass.class) explicitely. But in scala, by default spark find the encoder implicitly, and the encoders are build for scala case classes ("Product types") and scala standard collections. So just be mindful of which encoders are used. For example, if you create a Dataset of YourJavaClass in scala, I think you will probably have to give explicitly the Encoders.bean(YourJavaClass.class) for it to work and not have serialization issues.

    最后一个注意事项:您写道您使用的是Java Spring-boot.所以

    One last note: you wrote that you use java Spring-boot. So

    • 请注意,Spring设计完全违背scala/功能推荐的实践.到处都使用null和可变的东西.您仍然可以使用Spring,但是它在Scala中可能很奇怪,社区可能不会轻易接受它.
    • 您可以从spring上下文中调用spark代码,但不应在spark中使用spring(上下文),尤其是在spark分发的内部方法中,例如在 rdd.map 中.这将尝试在每个工作程序中创建Spring上下文,这非常慢并且很容易失败.
    • Be aware that Spring design goes completely against scala/functional recommended practice. Using null and mutable stuff all over. You can still use Spring, but it might be strange in scala, and the community will probably not accept it easily.
    • You can call spark code from a spring context, but should not use spring (context) from spark, especially inside methods distributed by spark, such as in rdd.map. This will attempt to create Spring context in each worker which is very slow and can easily fail.

    (*)关于"scala比Java更强大":我并不是说scala比Java更好(我确实这么认为,但这是一个品味问题:).我的意思是,scala语言比Java提供了更多的表达能力.基本上,用更少的代码即可完成更多工作.主要区别在于:

    (*) About "scala being more powerful than java": I don't mean that scala is better than java (well I do think so, but it is a matter of taste :). What I mean is that the scala language provides much more expressiveness than java. Basically it does more with less code. The main differences are:

    • 隐式,Spark API大量使用
    • monad +理解力
    • 当然还有功能强大的类型系统(例如,有关协变量类型的信息,List [Dog]是scala中List [Animal]的子类,而在Java中则不是)

    这篇关于互操作性:通过两种方式在Java和Scala之间共享对象或Row的数据集.我将Scala数据集操作放在Java中间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆