为什么SparkSession对一个动作执行两次? [英] Why does SparkSession execute twice for one action?

查看:441
本文介绍了为什么SparkSession对一个动作执行两次?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近升级到Spark 2.0,尝试从JSON字符串创建简单的数据集时遇到一些奇怪的行为.这是一个简单的测试用例:

Recently upgraded to Spark 2.0 and I'm seeing some strange behavior when trying to create a simple Dataset from JSON strings. Here's a simple test case:

 SparkSession spark = SparkSession.builder().appName("test").master("local[1]").getOrCreate();
 JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());

 JavaRDD<String> rdd = sc.parallelize(Arrays.asList(
            "{\"name\":\"tom\",\"title\":\"engineer\",\"roles\":[\"designer\",\"developer\"]}",
            "{\"name\":\"jack\",\"title\":\"cto\",\"roles\":[\"designer\",\"manager\"]}"
         ));

 JavaRDD<String> mappedRdd = rdd.map(json -> {
     System.out.println("mapping json: " + json);
     return json;
 });

 Dataset<Row> data = spark.read().json(mappedRdd);
 data.show();

输出:

mapping json: {"name":"tom","title":"engineer","roles":["designer","developer"]}
mapping json: {"name":"jack","title":"cto","roles":["designer","manager"]}
mapping json: {"name":"tom","title":"engineer","roles":["designer","developer"]}
mapping json: {"name":"jack","title":"cto","roles":["designer","manager"]}
+----+--------------------+--------+
|name|               roles|   title|
+----+--------------------+--------+
| tom|[designer, develo...|engineer|
|jack| [designer, manager]|     cto|
+----+--------------------+--------+

即使我只执行一项操作,"map"功能似乎也被执行了两次.我以为Spark会懒惰地构建一个执行计划,然后在需要时执行它,但这似乎使得为了将数据读取为JSON并对其进行任何处理,该计划至少必须执行两次.

It seems that the "map" function is being executed twice even though I'm only performing one action. I thought that Spark would lazily build an execution plan, then execute it when needed, but this makes it seem that in order to read data as JSON and do anything with it, the plan will have to be executed at least twice.

在这种简单情况下,没关系,但是当map函数长时间运行时,这将成为一个大问题.是这样吗,还是我错过了什么?

In this simple case it doesn't matter, but when the map function is long running, this becomes a big problem. Is this right, or am I missing something?

推荐答案

发生这种情况是因为您没有为DataFrameReader提供架构.结果,Spark必须急切地扫描数据集以推断输出模式.

It happens because you don't provide schema for DataFrameReader. As a result Spark has to eagerly scan data set to infer output schema.

由于未缓存mappedRdd,因此将被评估两次:

Since mappedRdd is not cached it will be evaluated twice:

  • 一次进行模式推断
  • 一次致电data.show

如果要阻止,则应为阅读器提供架构(Scala语法):

If you want to prevent you should provide schema for reader (Scala syntax):

val schema: org.apache.spark.sql.types.StructType = ???
spark.read.schema(schema).json(mappedRdd)

这篇关于为什么SparkSession对一个动作执行两次?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆