Spark _临时创建原因 [英] Spark _temporary creation reason

查看:146
本文介绍了Spark _临时创建原因的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么在将结果保存到文件系统时将火花上传到_temporary目录中,然后将其移动到输出文件夹而不是直接将其上传到输出文件夹?

Why does spark, while saving result to a file system, uploads result files to a _temporary directory and then move them to output folder instead of directly uploading them to output folder?

推荐答案

在使用文件系统时,两步过程是确保最终结果一致性的最简单方法.

Two stage process is the simplest way to ensure consistency of the final result when working with file systems.

您必须记住,每个执行程序线程都独立于其他线程写入其结果集,并且写入可以在不同的时间执行,甚至可以重复使用同一组资源.在写入时,Spark无法确定是否所有写入都会成功.

You have to remember that each executor thread writes its result set independent of the other threads and writes can be performed at different moments in time or even reuse the same set of resources. At the moment of write Spark cannot determine if all writes will succeed.

  • 万一发生故障,可以通过删除临时目录来回滚更改.
  • 如果成功,则可以通过移动临时目录来提交更改.

该模型的另一个好处是明显区分进行中的写入和最终的输出.因此,它可以轻松地与简单的工作流管理工具集成,而无需使用单独的状态存储或其他同步机制.

Another benefit of this model is clear distinction between writes in progress and finalized output. As a result it can easily integrated with simple workflow management tools, without a need of having a separate state store or other synchronization mechanism.

此模型简单,可靠,并且可以与设计了该文件系统的文件系统配合使用.不幸的是,它在不支持移动的对象存储中表现不佳.

This model is simple, reliable and works well with file systems for which it has been designed. Unfortunately it doesn't perform that well with object stores, which don't support moves.

这篇关于Spark _临时创建原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆