Spark Java Map 函数被执行两次 [英] Spark Java Map function is getting executed twice
问题描述
我有上面的代码作为 Spark 驱动程序,当我执行我的程序时,它可以正常将所需的数据保存为 Parquet 文件.
I have above code as Spark driver, when I execute my program it works properly saving required data as Parquet file.
String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
@Override
public String call(String patientId) throws Exception {
return "json array as string"
}
});
//1. Read json string array into a Dataframe (execution 1)
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
dataSchemaDF.write().parquet("md.parquet");
但是我观察到我在 RDD indexData
上的映射器函数被执行了两次.首先,当我使用 SQLContext
读取 jsonStringRdd
作为 DataFrame
其次,当我将 dataSchemaDF
写入 parquet 文件时
But I observed my mapper function on RDD indexData
is getting executed twice.
first, when I read jsonStringRdd
as DataFrame
using SQLContext
Second, when I write the dataSchemaDF
to the parquet file
你能指导我如何避免这种重复执行吗?还有其他更好的方法可以将 JSON 字符串转换为 Dataframe 吗?
Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting JSON string into a Dataframe?
推荐答案
我认为原因是缺少 JSON reader 的 schema.执行时:
I believe that the reason is a lack of schema for JSON reader. When you execute:
sqlContext.read().json(jsonStringRDD);
Spark 必须为新创建的 DataFrame
推断架构.为此,它具有扫描输入 RDD 并且急切地执行此步骤
Spark has to infer schema for a newly created DataFrame
. To do that it has scan input RDD and this step is performed eagerly
如果你想避免它,你必须创建一个 StructType
来描述 JSON 文档的形状:
If you want to avoid it you have to create a StructType
which describes the shape of the JSON documents:
StructType schema;
...
并在创建 DataFrame
时使用它:
and use it when you create DataFrame
:
DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);
这篇关于Spark Java Map 函数被执行两次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!