什么是"RDD可以存储在存储器中"?在Spark中意味着什么? [英] What does "RDDs can be stored in memory" mean in Spark?

查看:193
本文介绍了什么是"RDD可以存储在存储器中"?在Spark中意味着什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在介绍Spark时说

RDD可以在两次查询之间存储在内存中,而无需复制.

据我所知,您必须使用.cache().persist()手动缓存RDD.

As I know,you must cache RDD manually by using .cache() or .persist().If I take neither measure,like below

   val file = sc.textFile("hdfs://data/kv1.txt")
   file.flatMap(line => line.split(" "))
   file.count()

我不在缓存或磁盘中保留RDD文件",在这种情况下,Spark可以运行得更快 比MapReduce?

I don't persist the RDD "file" in cache or disk,in this condition, can Spark run faster than MapReduce?

推荐答案

将发生的事情是,Spark将逐个分区地计算计算的每个阶段.它会暂时将一些数据保存在内存中以完成其工作.它可能必须将数据洒到磁盘上并通过网络传输才能执行某些阶段.但是,这些都不是(必要的)持久性的.如果再次count(),它将从头开始.

What will happen is that Spark will compute, partition by partition, each stage of the computation. It will hold some data temporarily in memory to do its work. It may have to spill data to disk and transfer across the network to execute some stages. But none of this is (necessarily) persistent. If you count() again it would start from scratch.

这不是Spark的运行速度比MapReduce快的情况;对于像这样的简单操作,它可能更慢.实际上,没有什么可以从加载到内存中受益.

This is not a case where Spark would run faster than MapReduce; it would probably be slower for a simple operation like this. In fact, there is nothing about this that would benefit from loading into memory.

更复杂的示例(例如使用简单的管道或重复访问RDD)将显示出持久存储在内存中甚至磁盘上的好处.

More complex examples, like with a non-trivial pipeline or repeated access to the RDD, would show a benefit from persisting in memory, or even on disk.

这篇关于什么是"RDD可以存储在存储器中"?在Spark中意味着什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆