更改输出文件名preFIX为DataFrame.write（） [英] Change output filename prefix for DataFrame.write()

查看：1399 发布时间：2016/5/22 16:40:58 apache-spark apache-spark-sql

本文介绍了更改输出文件名preFIX为DataFrame.write（）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

输出文件开始与部分基名preFIX。例如。

Output files generated via the Spark SQL DataFrame.write() method begin with the "part" basename prefix. e.g.

DataFrame sample_07 = hiveContext.table("sample_07");
sample_07.write().parquet("sample_07_parquet");

结果：

hdfs dfs -ls sample_07_parquet/                                                                                                                                                             
Found 4 items
-rw-r--r--   1 rob rob          0 2016-03-19 16:40 sample_07_parquet/_SUCCESS
-rw-r--r--   1 rob rob        491 2016-03-19 16:40 sample_07_parquet/_common_metadata
-rw-r--r--   1 rob rob       1025 2016-03-19 16:40 sample_07_parquet/_metadata
-rw-r--r--   1 rob rob      17194 2016-03-19 16:40 sample_07_parquet/part-r-00000-cefb2ac6-9f44-4ce4-93d9-8e7de3f2cb92.gz.parquet

我想改变使用创建SQL星火一个DataFrame.write文件时使用的输出文件名preFIX（）。我试着设置上对于Spark背景下，Hadoop配置的马preduce.output.basename属性。例如。

I would like to change the output filename prefix used when creating a file using Spark SQL DataFrame.write(). I tried setting the "mapreduce.output.basename" property on the hadoop configuration for the Spark context. e.g.

public class MyJavaSparkSQL {

  public static void main(String[] args) throws Exception {
    SparkConf sparkConf = new SparkConf().setAppName("MyJavaSparkSQL");
    JavaSparkContext ctx = new JavaSparkContext(sparkConf);
    ctx.hadoopConfiguration().set("mapreduce.output.basename", "myprefix");
    HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(ctx.sc());
    DataFrame sample_07 = hiveContext.table("sample_07");
    sample_07.write().parquet("sample_07_parquet");
    ctx.stop();
  }

这并没有改变输出文件名preFIX所生成的文件。

That did not change the output filename prefix for the generated files.

有没有办法使用DataFrame.write（）方法时改写输出文件名preFIX？

Is there a way to override the output filename prefix when using the DataFrame.write() method?

推荐答案

在使用任何标准输出格式（如木地板），你不能改变的部分preFIX。请参阅从ParquetRelation <一这个片段href=\"https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L386\"相对=nofollow>来源$ C $ C ：

You cannot change the "part" prefix while using any of the standard output formats (like Parquet). See this snippet from ParquetRelation source code:

private val recordWriter: RecordWriter[Void, InternalRow] = {
  val outputFormat = {
    new ParquetOutputFormat[InternalRow]() {
      // ...
      override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = {
        // ..
        //  prefix is hard-coded here:
        new Path(path, f"part-r-$split%05d-$uniqueWriteJobId$bucketString$extension")
    }
  }
}

如果你真的必须控制的部分文件名，你可能不得不实现自定义FileOutputFormat并使用星火之一的保存接受FileOutputFormat类的方法（例如，<一个href=\"https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaPairRDD.html#saveAsHadoopFile(java.lang.String,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class)\"相对=nofollow> saveAsHadoopFile ）。

If you really must control the part file names, you'll probably have to implement a custom FileOutputFormat and use one of Spark's save methods that accept a FileOutputFormat class (e.g. saveAsHadoopFile).

这篇关于更改输出文件名preFIX为DataFrame.write（）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

更改输出文件名preFIX为DataFrame.write（） [英] Change output filename prefix for DataFrame.write()

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

更改输出文件名preFIX为DataFrame.write（） [英] Change output filename prefix for DataFrame.write()

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭