任务不可序列化-Spark Java [英] Task not Serializable - Spark Java

查看:183
本文介绍了任务不可序列化-Spark Java的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Spark中遇到任务无法序列化"错误.我已经搜索并尝试使用某些帖子中建议的静态函数,但仍然会出现相同的错误.

I'm getting the Task not serializable error in Spark. I've searched and tried to use a static function as suggested in some posts, but it still gives the same error.

代码如下:

public class Rating implements Serializable {
    private SparkSession spark;
    private SparkConf sparkConf;
    private JavaSparkContext jsc;
    private static Function<String, Rating> mapFunc;

    public Rating() {
        mapFunc = new Function<String, Rating>() {
            public Rating call(String str) {
                return Rating.parseRating(str);
            }
        };
    }

    public void runProcedure() { 
        sparkConf = new SparkConf().setAppName("Filter Example").setMaster("local");
        jsc = new JavaSparkContext(sparkConf);
        SparkSession spark = SparkSession.builder().master("local").appName("Word Count")
            .config("spark.some.config.option", "some-value").getOrCreate();        

        JavaRDD<Rating> ratingsRDD = spark.read().textFile("sample_movielens_ratings.txt")
                .javaRDD()
                .map(mapFunc);
    }

    public static void main(String[] args) {
        Rating newRating = new Rating();
        newRating.runProcedure();
    }
}

该错误给出:

如何解决此错误? 预先感谢.

How do I solve this error? Thanks in advance.

推荐答案

Rating显然不能为Serializable,因为它包含对Spark结构(即SparkSessionSparkConf等)的引用作为属性.

Clearly Rating cannot be Serializable, because it contains references to Spark structures (i.e. SparkSession, SparkConf, etc.) as attributes.

问题出在这里

JavaRDD<Rating> ratingsRD = spark.read().textFile("sample_movielens_ratings.txt")
            .javaRDD()
            .map(mapFunc);

如果查看mapFunc的定义,则返回的是Rating对象.

If you look at the definition of mapFunc, you're returning a Rating object.

mapFunc = new Function<String, Rating>() {
    public Rating call(String str) {
        return Rating.parseRating(str);
    }
};

此函数在map(Spark术语中的 transformation )内部使用.因为转换是直接在工作程序节点中执行的,而不是在驱动程序节点中执行的,所以它们的代码必须可可序列化.这迫使Spark尝试序列化Rating类,但是不可能.

This function is used inside a map (a transformation in Spark terms). Because the transformations are executed directly into the worker nodes and not in the driver node, their code must be serializable. This forces Spark to try serialize the Rating class, but it is not possible.

尝试从Rating中提取所需的功能,并将其放置在不拥有任何Spark结构的其他类中.最后,将此新类用作mapFunc函数的返回类型.

Try to extract the features you need from Rating, and placing them in a different class that does not own any Spark structure. Finally, use this new class as return type of your mapFunc function.

这篇关于任务不可序列化-Spark Java的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆