(为什么)我们需要在 RDD 上调用缓存还是持久化 [英] (Why) do we need to call cache or persist on a RDD

查看:35
本文介绍了(为什么)我们需要在 RDD 上调用缓存还是持久化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当从文本文件或集合(或从另一个 RDD)创建弹性分布式数据集 (RDD) 时,我们是否需要显式调用缓存"或持久化"以将 RDD 数据存储到内存中?还是RDD数据默认分布式存储在内存中?

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?

val textFile = sc.textFile("/user/emp.txt")

据我了解,经过上述步骤后,t​​extFile 是一个RDD,并且在所有/部分节点的内存中可用.

As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node's memory.

如果是这样,那为什么我们需要在 textFile RDD 上调用cache"或persist"呢?

If so, why do we need to call "cache" or "persist" on textFile RDD then?

推荐答案

大多数 RDD 操作都是惰性的.将 RDD 视为对一系列操作的描述.RDD 不是数据.所以这一行:

Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:

val textFile = sc.textFile("/user/emp.txt")

它什么都不做.它创建了一个 RDD,上面写着我们需要加载这个文件".此时未加载文件.

It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point.

需要观察数据内容的RDD操作不能懒惰.(这些被称为actions.)一个例子是RDD.count——告诉你文件中的行数,文件需要被读取.所以如果你写textFile.count,此时会读取文件,统计行数,返回计数.

RDD operations that require observing the contents of the data cannot be lazy. (These are called actions.) An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count, at this point the file will be read, the lines will be counted, and the count will be returned.

如果再次调用 textFile.count 会怎样?同样的事情:文件将被再次读取和计数.什么都不存储.RDD 不是数据.

What if you call textFile.count again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.

那么 RDD.cache 有什么作用呢?如果将 textFile.cache 添加到上面的代码中:

So what does RDD.cache do? If you add textFile.cache to the above code:

val textFile = sc.textFile("/user/emp.txt")
textFile.cache

它什么都不做.RDD.cache 也是一个惰性操作.该文件仍未读取.但是现在 RDD 说读取这个文件然后缓存内容".如果您第一次运行 textFile.count,文件将被加载、缓存和计数.如果您第二次调用 textFile.count,该操作将使用缓存.它只会从缓存中获取数据并计算行数.

It does nothing. RDD.cache is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run textFile.count the first time, the file will be loaded, cached, and counted. If you call textFile.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines.

缓存行为取决于可用内存.例如,如果文件不适合内存,则 textFile.count 将退回到通常的行为并重新读取文件.

The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.

这篇关于(为什么)我们需要在 RDD 上调用缓存还是持久化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆