为什么 SparkSession 为一个动作执行两次? [英] Why does SparkSession execute twice for one action?

查看:45
本文介绍了为什么 SparkSession 为一个动作执行两次?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近升级到 Spark 2.0,我在尝试从 JSON 字符串创建简单数据集时看到一些奇怪的行为.这是一个简单的测试用例:

Recently upgraded to Spark 2.0 and I'm seeing some strange behavior when trying to create a simple Dataset from JSON strings. Here's a simple test case:

 SparkSession spark = SparkSession.builder().appName("test").master("local[1]").getOrCreate();
 JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());

 JavaRDD<String> rdd = sc.parallelize(Arrays.asList(
            "{\"name\":\"tom\",\"title\":\"engineer\",\"roles\":[\"designer\",\"developer\"]}",
            "{\"name\":\"jack\",\"title\":\"cto\",\"roles\":[\"designer\",\"manager\"]}"
         ));

 JavaRDD<String> mappedRdd = rdd.map(json -> {
     System.out.println("mapping json: " + json);
     return json;
 });

 Dataset<Row> data = spark.read().json(mappedRdd);
 data.show();

和输出:

mapping json: {"name":"tom","title":"engineer","roles":["designer","developer"]}
mapping json: {"name":"jack","title":"cto","roles":["designer","manager"]}
mapping json: {"name":"tom","title":"engineer","roles":["designer","developer"]}
mapping json: {"name":"jack","title":"cto","roles":["designer","manager"]}
+----+--------------------+--------+
|name|               roles|   title|
+----+--------------------+--------+
| tom|[designer, develo...|engineer|
|jack| [designer, manager]|     cto|
+----+--------------------+--------+

即使我只执行一个操作,地图"功能似乎也被执行了两次.我以为 Spark 会懒洋洋地构建一个执行计划,然后在需要时执行它,但这使得为了将数据读取为 JSON 并对其执行任何操作,该计划必须至少执行两次.

It seems that the "map" function is being executed twice even though I'm only performing one action. I thought that Spark would lazily build an execution plan, then execute it when needed, but this makes it seem that in order to read data as JSON and do anything with it, the plan will have to be executed at least twice.

在这个简单的情况下没有关系,但是当map函数长时间运行时,这就成了一个大问题.这是对的,还是我遗漏了什么?

In this simple case it doesn't matter, but when the map function is long running, this becomes a big problem. Is this right, or am I missing something?

推荐答案

发生这种情况是因为您没有为 DataFrameReader 提供架构.因此,Spark 必须急切地扫描数据集以推断输出模式.

It happens because you don't provide schema for DataFrameReader. As a result Spark has to eagerly scan data set to infer output schema.

由于 mappedRdd 没有被缓存,所以会被评估两次:

Since mappedRdd is not cached it will be evaluated twice:

  • 一次用于模式推断
  • 调用 data.show
  • 一次

如果你想阻止你应该为读者提供模式(Scala语法):

If you want to prevent you should provide schema for reader (Scala syntax):

val schema: org.apache.spark.sql.types.StructType = ???
spark.read.schema(schema).json(mappedRdd)

这篇关于为什么 SparkSession 为一个动作执行两次?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆