在写入EMRFS时火花设置S3对象元数据 [英] Spark set S3 object metadata while writing to EMRFS
问题描述
我让Spark处理EMR,通过EMRFS将JSON文件写入S3:
数据框.coalesce(1).写().option(压缩","gzip").mode(SaveMode.Overwrite).json(outputPath);
问题是输出文件仅包含一个头 Content-Type =应用程序/八位字节流
.并且缺少另一个 Content-Encoding = gzip
.
在使用Spark编写输出文件时,如何将元数据 Content-Encoding = gzip
设置为输出文件?
您还可以使用options(Map)
dataframe.coalesce(1).write().mode(SaveMode.Overwrite).options(元数据选项).json(outputPath);
您需要导入
导入scala.collection.Map;
I have Spark working on EMR, writting JSON files to S3 through EMRFS:
dataframe
.coalesce(1)
.write()
.option("compression", "gzip")
.mode(SaveMode.Overwrite)
.json(outputPath);
The problem is that the output file contains only one header
Content-Type = application/octet-stream
. And lacks another Content-Encoding = gzip
.
How can I set metadata Content-Encoding = gzip
to the output file while writing it with Spark?
You could also use options(Map)
val metadataoptions = Map("compression" -> "gzip", "Content-Language" -> "US-En");
dataframe.coalesce(1).write().mode(SaveMode.Overwrite).options(metadataoptions).json(outputPath);
You need to import
import scala.collection.Map;
这篇关于在写入EMRFS时火花设置S3对象元数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!