在写入EMRFS时火花设置S3对象元数据 [英] Spark set S3 object metadata while writing to EMRFS

查看：62 发布时间：2021/4/8 20:11:59 apache-spark amazon-s3 amazon-emr

本文介绍了在写入EMRFS时火花设置S3对象元数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我让Spark处理EMR，通过EMRFS将JSON文件写入S3:

 数据框.coalesce(1).写().option(压缩"，"gzip").mode(SaveMode.Overwrite).json(outputPath);

问题是输出文件仅包含一个头 Content-Type =应用程序/八位字节流.并且缺少另一个 Content-Encoding = gzip .

在使用Spark编写输出文件时，如何将元数据 Content-Encoding = gzip 设置为输出文件?

解决方案

您还可以使用options(Map)

val metadataoptions = Map("compression"->"gzip"，"Content-Language"->"US-En");

dataframe.coalesce(1).write().mode(SaveMode.Overwrite).options(元数据选项).json(outputPath);

您需要导入
导入scala.collection.Map;

I have Spark working on EMR, writting JSON files to S3 through EMRFS:

dataframe
  .coalesce(1)
  .write()
  .option("compression", "gzip")
  .mode(SaveMode.Overwrite)
  .json(outputPath);

The problem is that the output file contains only one header Content-Type = application/octet-stream. And lacks another Content-Encoding = gzip.

How can I set metadata Content-Encoding = gzip to the output file while writing it with Spark?

解决方案

You could also use options(Map)

val metadataoptions = Map("compression" -> "gzip", "Content-Language" -> "US-En");

dataframe.coalesce(1).write().mode(SaveMode.Overwrite).options(metadataoptions).json(outputPath);

You need to import
import scala.collection.Map;

这篇关于在写入EMRFS时火花设置S3对象元数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文