Spark RDD是否缓存在工作程序节点或驱动程序节点(或两者)上? [英] Is Spark RDD cached on worker node or driver node (or both)?

查看:119
本文介绍了Spark RDD是否缓存在工作程序节点或驱动程序节点(或两者)上?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何人都可以请我纠正我对Spark持久化的理解.

Can any one please correct my understanding on persisting by Spark.

如果我们在RDD上执行过cache(),则其值仅在最初实际计算RDD的那些节点上缓存. 意思是,如果有一个包含100个节点的集群,并且在第一个和第二个节点的分区中计算RDD.如果我们缓存此RDD,则Spark将仅在第一个或第二个工作程序节点中缓存其值. 因此,当此Spark应用程序尝试在以后的阶段中使用此RDD时,Spark驱动程序必须从第一个/第二个节点获取值.

If we have performed a cache() on an RDD its value is cached only on those nodes where actually RDD was computed initially. Meaning, If there is a cluster of 100 Nodes, and RDD is computed in partitions of first and second nodes. If we cached this RDD, then Spark is going to cache its value only in first or second worker nodes. So when this Spark application is trying to use this RDD in later stages, then Spark driver has to get the value from first/second nodes.

我正确吗?

(OR)

RDD值是否持久存储在驱动程序内存中而不是节点上?

Is it something that the RDD value is persisted in driver memory and not on nodes ?

推荐答案

更改此内容:

然后,Spark仅将其值缓存在第一个第二个工作节点中.

then Spark is going to cache its value only in first or second worker nodes.

对此:

然后,Spark将仅在第一个辅助节点中缓存其值.

then Spark is going to cache its value only in first and second worker nodes.

并且... 正确!

Spark尝试将内存使用量降到最低(为此我们很喜欢!),因此它不会造成任何不必要的内存负载,因为它会懒惰地评估每条语句,即不会在任何转换上进行任何实际的工作,它将等待 action 发生,这比做实际的工作(读取文件,进行通信)没有任何选择余留给Spark将数据发送到网络,进行计算,然后将结果收集回驱动程序.)

Spark tries to minimize the memory usage (and we love it for that!), so it won't make any unnecessary memory loads, since it evaluates every statement lazily, i.e. it won't do any actual work on any transformation, it will wait for an action to happen, which leaves no choice to Spark, than to do the actual work (read the file, communicate the data to the network, do the computation, collect the result back to the driver, for example..).

您看到的是,我们不希望缓存所有内容,除非我们确实可以缓存(也就是说,内存容量允许(是的,我们可以在执行器或驱动程序中请求更多的内存,但是有时我们的集群没有资源,这在我们处理大数据时确实很常见,这确实是有道理的,例如,缓存的RDD将被一次又一次地使用(因此缓存它会加快速度)执行工作.

You see, we don't want to cache everything, unless we really can to (that is that the memory capacity allows for it (yes, we can ask for more memory in the executors or/and the driver, but sometimes our cluster just doesn't have the resources, really common when we handle big data) and it really makes sense, i.e. that the cached RDD is going to be used again and again (so caching it will speedup the execution of our job).

这就是为什么当您不再需要RDD时,您想要unpersist()的原因! :)

That's why you want to unpersist() your RDD, when you no longer need it...! :)

检查这张图片,是我的一项工作,我曾请求100位执行者,但是执行者"选项卡显示101,即100位奴隶/工人和一位主人/驾驶员:

Check this image, is from one of my jobs, where I had requested 100 executors, however the Executors tab displayed 101, i.e. 100 slaves/workers and one master/driver:

这篇关于Spark RDD是否缓存在工作程序节点或驱动程序节点(或两者)上?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆