Spark Java Map 函数被执行两次 [英] Spark Java Map function is getting executed twice

查看：21 发布时间：2021/11/14 21:37:39 java apache-spark apache-spark-sql rdd

本文介绍了Spark Java Map 函数被执行两次的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有上面的代码作为 Spark 驱动程序，当我执行我的程序时，它可以正常将所需的数据保存为 Parquet 文件.

I have above code as Spark driver, when I execute my program it works properly saving required data as Parquet file.

String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
  @Override
  public String call(String patientId) throws Exception {
   return "json array as string"
  }   
}); 

//1. Read json string array into a Dataframe (execution 1)
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
dataSchemaDF.write().parquet("md.parquet");

但是我观察到我在 RDD indexData 上的映射器函数被执行了两次.首先，当我使用 SQLContext 读取 jsonStringRdd 作为 DataFrame其次，当我将 dataSchemaDF 写入 parquet 文件时

But I observed my mapper function on RDD indexData is getting executed twice. first, when I read jsonStringRdd as DataFrame using SQLContext Second, when I write the dataSchemaDF to the parquet file

你能指导我如何避免这种重复执行吗?还有其他更好的方法可以将 JSON 字符串转换为 Dataframe 吗?

Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting JSON string into a Dataframe?

推荐答案

我认为原因是缺少 JSON reader 的 schema.执行时:

I believe that the reason is a lack of schema for JSON reader. When you execute:

sqlContext.read().json(jsonStringRDD);

Spark 必须为新创建的 DataFrame 推断架构.为此，它具有扫描输入 RDD 并且急切地执行此步骤

Spark has to infer schema for a newly created DataFrame. To do that it has scan input RDD and this step is performed eagerly

如果你想避免它，你必须创建一个 StructType 来描述 JSON 文档的形状:

If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:

StructType schema;
...

并在创建 DataFrame 时使用它:

and use it when you create DataFrame:

DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);

这篇关于Spark Java Map 函数被执行两次的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark Java Map 函数被执行两次 [英] Spark Java Map function is getting executed twice

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Spark Java Map 函数被执行两次 [英] Spark Java Map function is getting executed twice

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭