Spark累加器线程安全性和.value()性能 [英] Spark Accumulator Thread Safety and .value() performance

查看:64
本文介绍了Spark累加器线程安全性和.value()性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

累加器在设计上不是线程安全的.但是,当调用 .value()时,文档指出值返回到主驱动程序线程.驱动程序会获取最新值吗?

Accumulators are not thread-safe by design. However, when .value() is called, documentation says that values come back to the main driver thread. Does driver program gets the up to date values?

此外, .value()操作是否昂贵,因为它会强制所有工作线程进行响应并将值发送回主驱动程序?如果是这样,那有什么替代方案?

Moreover, is .value() operation expensive since it forces all the worker threads to respond and send back values to main driver program? If so, then what are the alternatives?

我有自定义线程安全累加器.但是,我觉得这可能是一个过大的杀伤力.

I have my custom thread safe Accumulator. However, I feel like it might be an overkill.

推荐答案

回复:累加器在设计上不是线程安全的."

Re:"Accumulators are not thread-safe by design.".

我在Spark文档中找不到此内容.我相信您可能是在指Java累加器.驱动程序应用程序中累加器的线程安全将取决于您如何实现驱动程序.

I did not find this in Spark documentation. You might be referring to Java accumulators I believe. Thread safety of the accumulator in your driver application would depend on the fact how you have implemented the driver program.

但是要注意的一件事是,累加器可能并不可靠.这种不可靠性来自可以重试失败的Spark任务的事实.在这种情况下,累加器不会为您提供准确的值.

However one thing to note that is Accumulators may not be reliable. The unreliability comes from the fact that a failed Spark task could be retried. In such cases the accumulators will not give you accurate values.

Re:此外,.value()操作成本很高,因为它会强制所有工作线程响应并将值发送回主驱动程序程序?"

Re:"Moreover, is .value() operation expensive since it forces all the worker threads to respond and send back values to main driver program?"

我不确定是否是这种情况,因为无论您是否使用累加器,执行器都需要将心跳消息发送回驱动程序.而且,与诸如收集之类的动作相比,它可能不会有很多数据.(如果您正在使用大数据).IMO,调用.value()应该不会引起很大的性能问题.此外,在批处理中,通常在执行程序完成任务后通常希望调用.value()的驱动程序应用程序

I am not sure if that is the case as regardless you use accumulators or no, Executors need to send heartbeat messages back to the driver. Moreover it may not be a lot of data when compared to action such as collect. (If you are working with Big Data). IMO, invoking .value() should not be a big performance concern. Additionally in Batch processing, the driver application you would generally want to invoke .value() once your executors have finished their tasks

这篇关于Spark累加器线程安全性和.value()性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆