如何在保存DataFrame时避免生成crc文件和SUCCESS文件? [英] How to avoid generating crc files and SUCCESS files while saving a DataFrame?
问题描述
我正在使用以下代码将Spark DataFrame保存到JSON文件
I am using the following code to save a spark DataFrame to JSON file
unzipJSON.write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json")
输出结果是:
part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8
.part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8.crc
part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8
.part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8.crc
_SUCCESS
._SUCCESS.crc
- 如何生成单个JSON文件而不是每行一个文件?
- 如何避免* crc文件?
- 如何避免使用SUCCESS文件?
推荐答案
如果要单个文件,则需要在调用write之前对单个分区执行coalesce
,所以:
If you want a single file, you need to do a coalesce
to a single partition before calling write, so:
unzipJSON.coalesce(1).write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json")
就我个人而言,我感到很烦人的是,输出文件的数量取决于调用write
之前具有的分区数量-尤其是如果您对partitionBy
进行write
-但据我所知,目前没有其他方法.
Personally, I find it rather annoying that the number of output files depend on number of partitions you have before calling write
- especially if you do a write
with a partitionBy
- but as far as I know, there are currently no other way.
我不知道是否有禁用.crc文件的方法-我不知道-但是您可以通过在Spark上下文的hadoop配置上设置以下内容来禁用_SUCCESS文件.
I don't know if there is a way to disable the .crc files - I don't know of one - but you can disable the _SUCCESS file by setting the following on the hadoop configuration of the Spark context.
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
请注意,您可能还希望通过以下方式禁用元数据文件的生成:
Note, that you may also want to disable generation of the metadata files with:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
显然,生成元数据文件需要一些时间(请参见此).就我个人而言,我总是禁用它们,而我没有任何问题.
Apparently, generating the metadata files takes some time (see this blog post) but aren't actually that important (according to this). Personally, I always disable them and I have had no issues.
这篇关于如何在保存DataFrame时避免生成crc文件和SUCCESS文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!