ClosureCleaner.clean的目的 [英] The purpose of ClosureCleaner.clean

查看:270
本文介绍了ClosureCleaner.clean的目的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

sc.runJob调用dagScheduler.runJob之前,在rdd上执行的函数被ClosureCleaner.clean清除". 为什么星火必须这样做?目的是什么?

Before sc.runJob invokes dagScheduler.runJob, the func performed on the rdd is "cleaned" by ClosureCleaner.clean. Why spark has to do this? What’s the purpose?

推荐答案

Spark Committer的同伴Ankur Dave撰写了

Ankur Dave, a fellow Spark Committer, wrote a good explanation of ClosureCleaner on Quora, reproduced below:

当Scala构造一个闭包时,它确定该闭包将使用哪些外部变量,并将对它们的引用存储在闭包对象中.这样,即使从不同于其创建范围的作用域调用了闭包,该闭包也能正常工作.

When Scala constructs a closure, it determines which outer variables the closure will use and stores references to them in the closure object. This allows the closure to work properly even when it's called from a different scope than it was created in.

Scala有时会在捕获太多外部变量方面出错(请参阅 SI-1419 ).在大多数情况下,这是没有害处的,因为多余的捕获变量根本不会被使用(尽管这会阻止它们获取GC).但这给Spark带来了一个问题,Spark必须通过网络发送闭包,以便它们可以在从属服务器上运行.当闭包包含不必要的引用时,它将浪费网络带宽.更重要的是,某些引用可能指向不可序列化的对象,Spark将无法序列化闭包.

Scala sometimes errs on the side of capturing too many outer variables (see SI-1419). That's harmless in most cases, because the extra captured variables simply don't get used (though this prevents them from getting GC'd). But it poses a problem for Spark, which has to send closures across the network so they can be run on slaves. When a closure contains unnecessary references, it wastes network bandwidth. More importantly, some of the references may point to non-serializable objects, and Spark will fail to serialize the closure.

要解决Scala中的此错误,请

To work around this bug in Scala, the ClosureCleaner traverses the object at runtime and prunes the unnecessary references. Since it does this at runtime, it can be more accurate than the Scala compiler can. Spark can then safely serialize the cleaned closure.

这篇关于ClosureCleaner.clean的目的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆