在Spark Streaming中,我必须在cache()或persist()之后调用count()来强制缓存/持久性真正发生吗? [英] in spark streaming must i call count() after cache() or persist() to force caching/persistence to really happen?

查看:66
本文介绍了在Spark Streaming中,我必须在cache()或persist()之后调用count()来强制缓存/持久性真正发生吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

观看很好的视频演讲者说,除非在缓存后对RDD执行操作,否则不会真正发生缓存.

watching this very good video on spark internals the presenter says that unless one performs an action on ones RDD after caching it caching will not really happen.

在任何其他情况下,我都从未看到过count()被调用.因此,我猜测他只是在cache()之后调用count(),以在他给出的简单示例中强制持久化.不必每次在自己的代码中调用cache()或persist()时都执行此操作.是这样吗?

I never see count() being called in any other circumstances. So, I'm guessing that he is only calling count() after cache() to force persistence in the simple example he is giving. It is not necessary to do this every time one calls cache() or persist() in one's code. Is this right ?

推荐答案

除非将RDD缓存后再对其执行任何操作,否则缓存不会真正发生.

unless one performs an action on ones RDD after caching it caching will not really happen.

这是100%正确的.方法 cache / persist 只会将RDD标记为要进行缓存.每当在RDD上调用操作时,都会缓存RDD中的项目.

This is 100% true. The methods cache/persist will just mark the RDD for caching. The items inside the RDD are cached whenever an action is called on the RDD.

...在他给出的简单示例中,仅在cache()之后调用count()以强制持久性.不必每次在自己的代码中调用cache()或persist()时都执行此操作.是这样吗?

...only calling count() after cache() to force persistence in the simple example he is giving. It is not necessary to do this every time one calls cache() or persist() in one's code. Is this right ?

您又100%正确了.但我会对此进行详细说明.

You are 100% right again. But I'll elaborate on this a bit.

为便于理解,请考虑以下示例.

For easy understanding, consider below example.

rdd.cache()
rdd.map(...).flatMap(...) //and so on
rdd.count() //or any other action

假设您的RDD中有10个文档.运行以上代码段后,每个文档将完成以下任务:

Assume you have 10 documents in your RDD. When the above snippet is run, each document goes through these tasks:

  • 已缓存
  • 地图功能
  • flatMap函数

另一方面,

rdd.cache().count()  
rdd.map(...).flatMap(...)  //and so on
rdd.count()  //or any other action

运行上述代码段时,所有10个文档都首先被缓存(整个RDD ).然后应用map函数和flatMap函数.

When the above snippet is run, all the 10 documents are cached first(the whole RDD). Then map function and the flatMap function are applied.

两者都是正确的,并按要求使用.希望这可以使事情变得更清楚.

Both are right and are used as per the requirements. Hope this is makes the things more clear.

这篇关于在Spark Streaming中,我必须在cache()或persist()之后调用count()来强制缓存/持久性真正发生吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆