当蓄电池是真正可靠吗? [英] When are accumulators truly reliable?

查看:209
本文介绍了当蓄电池是真正可靠吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用累加器收集关于我操纵一个星火任务中的数据统计一些。理想情况下,我会做,虽然工作计算所需的变换,但由于星火将重新计算在不同情况下的任务累加器不会反映真实指标。下面是文档如何描述这样的:

I want to use an accumulator to gather some stats about the data I'm manipulating on a Spark job. Ideally, I would do that while the job computes the required transformations, but since Spark would re-compute tasks on different cases the accumulators would not reflect true metrics. Here is how the documentation describes this:

有关内部操作只,星火进行更新累加器
  保证每个任务的累加器更新将仅
  申请一次,即重启任务将不会更新值。在
  变换,用户应该知道,每个任务的更新可能
  不止一次如果重新执行任务或作业分阶段实施等等。

For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.

这是令人困惑,因为大多数动作不允许运行定制code(其中蓄电池可以使用),它们大多采取从previous结果转换(懒洋洋)。该文档还显示了这一点:

This is confusing since most actions do not allow running custom code (where accumulators can be used), they mostly take the results from previous transformations (lazily). The documentation also shows this:

val acc = sc.accumulator(0)
data.map(x => acc += x; f(x))
// Here, acc is still 0 because no actions have cause the `map` to be computed.

但是,如果我们添加 data.count()末,将这一保证是正确的(无重复)或不?显然 ACC 不使用唯一范围内的行为,因为地图是一个转变。因此它不应该被保证,

But if we add data.count() at the end, would this be guaranteed to be correct (have no duplicates) or not? Clearly acc is not used "inside actions only", as map is a transformation. So it should not be guaranteed.

在另一方面,对相关吉拉门票讨论谈结果任务,而不是动作。对于这里例如和的此处。这似乎表明,其结果的确保证是正确的,因为我们使用 ACC 之前和行动,因此应被计算为一个阶段。

On the other hand, discussion on related Jira tickets talk about "result tasks" rather than "actions". For instance here and here. This seems to indicate that the result would indeed be guaranteed to be correct, since we are using acc immediately before and action and thus should be computed as a single stage.

我猜测,一个结果任务的概念与操作涉及,作为最后一个,其中包括一个动作,如在本实施例中的类型,其显示如何将几个操作分为阶段(办在品红色,从<拍摄图像href=\"http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/\">here):

I'm guessing that this concept of a "result task" has to do with the type of operations involved, being the last one that includes an action, like in this example, which shows how several operations are divided into stages (in magenta, image taken from here):

所以假设,在这个链条的最后一个计数()的行动将是相同的最后阶段的一部分,我会保证在最后使用的蓄电池地图将不包括任何重复?

So hypothetically, a count() action at the end of that chain would be part of the same final stage, and I would be guaranteed that accumulators used on the last map will no include any duplicates?

解决此问题的澄清将是伟大的!谢谢你。

Clarification around this issue would be great! Thanks.

推荐答案

要回答这个问题:如果是蓄电池的真正可靠吗?结果

To answer the question "When are accumulators truly reliable ?"

答:当他们在一个动作操作present结果

Answer : When they are present in an Action operation.

根据行动任务的文档,即使重新启动任何任务是present它将更新累加器只有一次。

As per the documentation in Action Task, even if any restarted tasks are present it will update Accumulator only once.

有关范围内的行为进行蓄能只更新,星火保证每个任务的累加器更新将只应用一次,即重新开始任务将不会更新值。在转换,用户应该意识到每个任务的更新可能会不止一次如果重新执行任务或作业分阶段实施等等。

For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.

和行动确实允许运行自定义code。 的结果

有关防爆。

val accNotEmpty = sc.accumulator(0)
ip.foreach(x=>{
  if(x!=""){
    accNotEmpty += 1
  }
})

但是,为什么地图+动作即。结果任务操作的不可靠作为累加器操作?

But, Why Map+Action viz. Result Task operations are not reliable for an Accumulator operation?


  1. 任务失败,原因是在code一些例外。星火将尝试4倍(尝试的默认数)。如果任务失败,每次将有机会它成功则火花会继续,只是更新成功状态累加器值和失败国家累加器值举一个exception.If都将被忽略。< BR>判决:正确处理

  2. 舞台故障:如果一个执行节点崩溃,无故障的用户,但一个硬件故障 - 而如果节点出现在洗牌stage.As下来洗牌输出存储在本地,如果一个节点出现故障,即洗牌输出已经一去不复返了。所以火花追溯到生成该洗牌输出阶段,着眼于哪些任务需要被重新运行,并执行它们上的一个节点即仍然alive.After我们重新生成丢失的洗牌输出,它产生的映射输出的阶段。已经执行了一些它的任务多times.Spark从所有这些计算累加器更新结果判决:在结果Task.Accumulator没有处理会给出错误输出

  3. 如果一个任务运行速度慢的话,星星之火可以在另一个节点上启动该任务的投机副本结果判决:不handled.Accumulator会给出错误输出

  4. RDD这是缓存是巨大的,只要使用RDD将重新运行该地图操作来获得RDD并再次蓄电池会被它更新不能驻留在Memory.So结果判决:没有处理.Accumulator会给出错误的输出。

因此​​,它可能会发生同样的功能可以运行在同一data.So星火多个时间不提供蓄电池任何保证,因为地图的操作得到更新。

So it may happen same function may run multiple time on same data.So Spark does not provide any guarantee for accumulator getting updated because of the Map operation.

因此​​,最好是在星火使用累加器在行动中运行。

So it is better to use Accumulator in Action operation in Spark.

要了解更多关于蓄电池及其问题,请参阅本博客文章 - 通过伊姆兰·拉希德。

To know more about Accumulator and its issues refer this Blog Post - By Imran Rashid.

这篇关于当蓄电池是真正可靠吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆