Spark数据集方法是否会序列化计算本身? [英] Does Spark data set method serialize the computation itself?

查看:72
本文介绍了Spark数据集方法是否会序列化计算本身?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含多列的数据集.需要调用一个函数以使用一行中的可用数据来计算结果.因此,我将案例类与方法结合使用,并使用该案例创建了数据集.例如,

I have a data set with multiple columns. A function needs to be invoked to compute result using the data available within a row. So I used a case class with a method and created a data set using it. As example,

case class testCase(x: Double, a1: Array[Double], a2: Array[Double]) {
    var someInt = 0
    def myMethod1(): Unit = {...}    // use x, a1 and a2
    def myMethod2(): Unit = {...}    // use x, a1 and a2
    def result(): { return someInt }

它是从 main()中被调用为

val res = myDS.map(_.result()).toDF("result")

我面临的问题是,尽管代码正常运行,但无论我如何调用,与程序的其他部分都不一样,以上语句不能同时运行.不管执行器,内核和 repartition 的数量如何,方法中只有一个实例似乎可以正常工作!

The problem I am facing is that while the code works correctly, no matter how I invoke, unlike for the other parts of the program, the above statement does not work concurrently. Irrespective of the number executors, cores and repartitioning, only one instance of a method seems to work at time!

任何暗示我应该看的东西都会受到赞赏.

Any hints to what I should look at would be appreciated.

推荐答案

testCase案例类不应是可变的,如果您同时修改对象的状态,则程序将因此不确定.您所提供的一些信息看起来有什么问题,是此代码段

testCase case class should not be mutable, if you concurrently modify the state of the object your program will be non deterministic because of that. What is looking wrong with the few information you are giving is this snippet

var someInt = 0

该值可能正在由多个任务同时修改,我很确定您不希望这样做.

that value is probably being modified concurrently by several tasks and I am pretty sure you don't want that.

你能解释一下你想做什么吗?

Can you explain what are you trying to do?

这篇关于Spark数据集方法是否会序列化计算本身?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆