(为什么)我们需要调用缓存或坚持在RDD [英] (Why) do we need to call cache or persist on a RDD

查看:185
本文介绍了(为什么)我们需要调用缓存或坚持在RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当从一个文本文件或集合(或从另一个RDD),我们需要调用缓存或坚持明确的RDD数据存储到内存中创建一个弹性分布式数据集(RDD)?或者是默认存储在分布式方式在存储器中的RDD数据?

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?

val textFile = sc.textFile("/user/emp.txt")

按我的理解,上述步骤之后,文本文件是RDD,并适用于所有/一些节点的内存。

As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node's memory.

如果是这样,为什么我们需要调用缓存或坚持的文本文件RDD呢?

If so, why do we need to call "cache" or "persist" on textFile RDD then?

推荐答案

大多数RDD操作是懒惰。一个RDD看作一系列操作的描述。一个RDD不是数据。所以这行:

Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:

val textFile = sc.textFile("/user/emp.txt")

它什么都不做。它创建,说:我们需要加载此文件的RDD。该文件未在此时装载

It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point.

这需要观察数据的内容RDD操作不能偷懒。 (这些被称为的动作的)。一个例子是 RDD.count - 告诉你在文件中的行数,该文件需要读。所以,如果你写 textFile.count ,此时该文件将被读取,该线将被计数,计数将被退回。

RDD operations that require observing the contents of the data cannot be lazy. (These are called actions.) An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count, at this point the file will be read, the lines will be counted, and the count will be returned.

如果您 textFile.count 再次呼吁什么?同样的事情:该文件将被读取并重新计数。没有被存储。一个RDD不是数据。

What if you call textFile.count again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.

那么,是什么 RDD.cache 吗?如果添加 textFile.cache 上述code:

So what does RDD.cache do? If you add textFile.cache to the above code:

val textFile = sc.textFile("/user/emp.txt")
textFile.cache

它什么都不做。 RDD.cache 也是一个懒惰的操作。该文件仍然无法读取。但是,现在的RDD说,读取该文件,然后缓存的内容。如果你再运行 textFile.count 第一次,该文件将被加载,缓存,并计数。如果你调用 textFile.count 第二次,该操作将使用缓存。它将只是从缓存中的数据并计算行。

It does nothing. RDD.cache is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run textFile.count the first time, the file will be loaded, cached, and counted. If you call textFile.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines.

缓存行为取决于可用内存。如果该文件不存在于内存适合,例如,那么 textFile.count 将回落到平时的所作所为,重新读取该文件。

The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.

这篇关于(为什么)我们需要调用缓存或坚持在RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆