Spark缓存与广播 [英] Spark cache vs broadcast

查看：131 发布时间：2020/9/4 5:31:34 caching apache-spark

本文介绍了Spark缓存与广播的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

看来，广播方法在我的集群中制作了RDD的分布式副本.另一方面，执行cache()方法只是将数据加载到内存中.

It looks like broadcast method makes a distributed copy of RDD in my cluster. On the other hand execution of cache() method simply loads data in memory.

但是我不理解缓存的RDD如何在集群中分布.

But I do not understand how does cached RDD is distributed in the cluster.

请问我在什么情况下应该使用rdd.cache()和rdd.broadcast()方法?

Could you please tell me in what cases should I use rdd.cache() and rdd.broadcast() methods?

推荐答案

请告诉我在什么情况下应该使用rdd.cache()和 rdd.broadcast()方法?

Could you please tell me in what cases should I use rdd.cache() and rdd.broadcast() methods?

RDD分为分区.这些分区本身充当整个RDD的不变子集.当Spark执行图形的每个阶段时，每个分区都将发送到对数据子集进行操作的工作程序.反过来，如果需要重新声明RDD，则每个工作人员都可以缓存数据.

RDDs are divided into partitions. These partitions themselves act as an immutable subset of the entire RDD. When Spark executes each stage of the graph, each partition gets sent to a worker which operates on the subset of the data. In turn, each worker can cache the data if the RDD needs to be re-iterated.

广播变量用于将一次的不可变状态发送给每个工作人员.当您需要变量的本地副本时，可以使用它们.

Broadcast variables are used to send some immutable state once to each worker. You use them when you want a local copy of a variable.

这两个操作彼此非常不同，每个操作代表一个解决不同问题的方法.

These two operations are quite different from each other, and each one represents a solution to a different problem.

这篇关于Spark缓存与广播的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark缓存与广播 [英] Spark cache vs broadcast

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark缓存与广播 [英] Spark cache vs broadcast

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭