我应该将变量保留为瞬态吗? [英] Should I leave the variable as transient?

查看:85
本文介绍了我应该将变量保留为瞬态吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试使用Apache Spark尝试解决诸如top-k,天际线等问题.

I have been experimenting with Apache Spark trying to solve some queries like top-k, skyline etc.

我制作了一个包装器,其中封装了名为SparkContextSparkConfJavaSparkContext.该类还实现了可序列化,但是由于SparkConfJavaSparkContext不可序列化,因此该类也不是.

I have made a wrapper which encloses SparkConf and JavaSparkContext named SparkContext. This class also implements serializable but since SparkConf and JavaSparkContext are not serializable then the class isn't either.

我有一个解决名为TopK的topK查询的类,该类实现可序列化,但该类还具有一个不可序列化的SparkContext成员变量(由于上述原因).因此,每当我尝试从RDD中的.reduce()函数中执行TopK方法时,都会出现异常.

I have a class solving the topK query named TopK, the class implements serializable but the class also has a SparkContext member variable which is not serializable (for the reason above). Therefore I am getting an exception whenever I try to execute a TopK method from within a .reduce() function in an RDD.

我发现的解决方案是使SparkContext瞬态.

The solution I have found is to make SparkContext transient.

我的问题是:我应该将SparkContext变量保持为瞬态,还是犯了大错误?

My question is: Should I keep the SparkContext variable as transient or am I doing a big mistake?

SparkContext类:

SparkContext class:

import java.io.Serializable;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.*;

public class SparkContext implements Serializable {

    private final SparkConf sparConf; // this is not serializable
    private final JavaSparkContext sparkContext; // this is not either

    protected SparkContext(String appName, String master) {
        this.sparConf = new SparkConf();
        this.sparConf.setAppName(appName);
        this.sparConf.setMaster(master);

        this.sparkContext = new JavaSparkContext(sparConf);
    }

    protected JavaRDD<String> textFile(String path) {
        return sparkContext.textFile(path);
    }

}

TopK类:

TopK class:

public class TopK implements QueryCalculator, Serializable {

    private final transient SparkContext sparkContext;
    .
    .
    .
}

抛出Task not serializable异常的示例. getBiggestPointByXDimension甚至都不会输入,因为要使其在reduce函数中执行,包围它的类(TopK)必须是可序列化.

Example that throws Task not serializable exception. getBiggestPointByXDimension won't even get entered because in order for it to be executed in a reduce function the class enclosing it (TopK) must be serializable.

private Point findMedianPoint(JavaRDD<Point> points) {
    Point biggestPointByXDimension = points.reduce((a, b) -> getBiggestPointByXDimension(a, b));
    .
    .
    .
}

private Point getBiggestPointByXDimension(Point first, Point second) {
        return first.getX() > second.getX() ? first : second;
    }

推荐答案

对于您的问题:我应该将SparkContext变量保留为瞬态吗?

是的.没关系.它只是封装了(Java)SparkContext,并且该上下文在工作线程上不可用,因此将其标记为transient只是告诉序列化器不要序列化该字段.

Yes. That's ok. It's only encapsulating the (Java)SparkContext and the context is not usable on the workers, so marking it transient just tells the Serializer not to serialize that field.

您还可以拥有自己的SparkContext包装器,该包装器不可序列化并将其标记为瞬态-与上述效果相同. (顺便说一句,鉴于SparkContext是Spark上下文的Scala类名称,我选择了另一个名称,以避免一路迷路.)

You could also have your own SparkContext wrapper not serializable and mark it as transient - same effect as above. (BTW, Given that SparkContext is the Scala class name for the spark context, I'd chose another name to avoid confusion down the road.)

另一件事:正如您所指出的那样,Spark尝试序列化整个封闭类的原因是,因为在封闭内部使用了该类的方法.避免那样!使用匿名类或自包含的闭包(最后将转换为匿名类).

One more thing: As you pointed out, the reason why Spark is trying to serialize the complete enclosing class, is because a method of the class is being used within a closure. Avoid that!. Use an anonymous class or a self-contained closure (which will translate into an anonymous class at the end).

这篇关于我应该将变量保留为瞬态吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆