Spark java Map函数正在执行两次 [英] Spark java Map function is getting executed twice

查看:593
本文介绍了Spark java Map函数正在执行两次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将以上代码作为spark驱动程序,当我执行我的程序时,它可以正常工作,将所需数据保存为镶木地板文件。

I have above code as spark driver, when I execute my program it works properly saving required data as parquet file.

      String indexFile = "index.txt";
      JavaRDD<String> indexData = sc.textFile(indexFile).cache();
      JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
        @Override
        public String call(String patientId) throws Exception {
         return "json array as string"
        }   
      }); 

//1. Read json string array into a Dataframe (execution 1)
       DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
       dataSchemaDF.write().parquet("md.parquet");

但是我在RDD上观察了我的mapper函数 indexData 正在执行两次。
首先,当我使用 SQLContext jsonStringRdd 作为 DataFrame $ c>
其次,当我将 dataSchemaDF 写入镶木地板文件时

But i observed my mapper function on RDD indexData is getting executed twice. first, when I read jsonStringRdd as DataFrame using SQLContext Second, when I write the dataSchemaDF to the parquet file

你能指导我吗?对此,如何避免这种重复执行?有没有其他更好的方法将json字符串转换为数据帧?

Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting json string into a Dataframe?

推荐答案

我认为原因是缺乏架构JSON读者。当你执行:

I believe that the reason is a lack of schema for JSON reader. When you execute:

sqlContext.read().json(jsonStringRDD);

Spark必须推断新创建的 DataFrame 。要做到这一点,它有扫描输入RDD,这一步骤是急切执行

Spark has to infer schema for a newly created DataFrame. To do that it has scan input RDD and this step is performed eagerly

如果你想避免它,你必须创建一个 StructType 描述了JSON文档的形状:

If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:

StructType schema;
...

并在创建 DataFrame <时使用它/ code>:

and use it when you create DataFrame:

DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);

这篇关于Spark java Map函数正在执行两次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆