Spark RDD 是否缓存在工作节点或驱动程序节点(或两者)上? [英] Is Spark RDD cached on worker node or driver node (or both)?

查看:20
本文介绍了Spark RDD 是否缓存在工作节点或驱动程序节点(或两者)上?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何人都可以纠正我对 Spark 坚持的理解.

Can any one please correct my understanding on persisting by Spark.

如果我们在 RDD 上执行了 cache(),它的值只会缓存在那些最初计算 RDD 的节点上.意思是,如果有一个由 100 个节点组成的集群,并且 RDD 是在第一个和第二个节点的分区中计算的.如果我们缓存了这个 RDD,那么 Spark 将只在第一个或第二个工作节点中缓存它的值.所以当这个 Spark 应用程序在后期尝试使用这个 RDD 时,那么 Spark 驱动程序必须从第一个/第二个节点获取值.

If we have performed a cache() on an RDD its value is cached only on those nodes where actually RDD was computed initially. Meaning, If there is a cluster of 100 Nodes, and RDD is computed in partitions of first and second nodes. If we cached this RDD, then Spark is going to cache its value only in first or second worker nodes. So when this Spark application is trying to use this RDD in later stages, then Spark driver has to get the value from first/second nodes.

我说得对吗?

(或)

RDD 值是否持久存在于驱动程序内存中而不是节点上?

Is it something that the RDD value is persisted in driver memory and not on nodes ?

推荐答案

改变这个:

然后 Spark 将仅在第一个 第二个工作节点中缓存其值.

then Spark is going to cache its value only in first or second worker nodes.

到此:

然后 Spark 将仅在第一个 第二个工作节点中缓存其值.

then Spark is going to cache its value only in first and second worker nodes.

而且...是的正确!

Spark 尝试最小化内存使用量(我们喜欢它!),因此它不会产生任何不必要的内存负载,因为它懒惰评估每个语句,即它不会在任何转换上做任何实际工作,它会等待一个动作发生,这让Spark别无选择,只能做实际工作(读取文件,交流例如,将数据发送到网络,进行计算,将结果收集回驱动程序..).

Spark tries to minimize the memory usage (and we love it for that!), so it won't make any unnecessary memory loads, since it evaluates every statement lazily, i.e. it won't do any actual work on any transformation, it will wait for an action to happen, which leaves no choice to Spark, than to do the actual work (read the file, communicate the data to the network, do the computation, collect the result back to the driver, for example..).

你看,我们不想缓存所有东西,除非我们真的可以(也就是说内存容量允许这样做(是的,我们可以在执行程序或/和驱动程序中请求更多内存,但有时我们的集群只是没有资源,这在我们处理大数据时很常见),这真的很有意义,即缓存的 RDD 将被一次又一次地使用(因此缓存它会加速执行我们的工作).

You see, we don't want to cache everything, unless we really can to (that is that the memory capacity allows for it (yes, we can ask for more memory in the executors or/and the driver, but sometimes our cluster just doesn't have the resources, really common when we handle big data) and it really makes sense, i.e. that the cached RDD is going to be used again and again (so caching it will speedup the execution of our job).

这就是为什么你想要 unpersist() 你的 RDD,当你不再需要它时......!:)

That's why you want to unpersist() your RDD, when you no longer need it...! :)

检查这张图片,来自我的一项工作,我请求了 100 个执行者,但是执行者选项卡显示 101,即 100 个奴隶/工人和一个主/驱动:

Check this image, is from one of my jobs, where I had requested 100 executors, however the Executors tab displayed 101, i.e. 100 slaves/workers and one master/driver:

这篇关于Spark RDD 是否缓存在工作节点或驱动程序节点(或两者)上?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆