Spark RDD - 是否总是在RAM中进行分区? [英] Spark RDD - is partition(s) always in RAM?

查看:178
本文介绍了Spark RDD - 是否总是在RAM中进行分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们都知道Spark在内存中进行计算。


  1. 如果我创建10 RDD 在HDFS的pySpark shell中,是否所有这些10 RDD 的数据都驻留在Spark Workers Memory上?如果我不删除 RDD ,它会永远在内存中吗?

  2. / li>
  3. 如果我的数据集(文件)大小超过了可用的RAM大小,数据将存储在哪里?



解决方案


如果我在HDFS的pySpark shell中创建10个RDD,是否表示所有这10个RDD
数据将驻留在Spark Memory上


是,所有10个RDD数据将在Spark Worker计算机中传播RAM 即可。但并非所有机器都必须具有每个RDD的分区。 off课程RDD只有在对其进行任何操作时才会在内存中存储数据。


如果我不删除RDD,它会永久存储在内存中吗?

Spark自动取消RDD或Dataframe 如果他们不再使用。为了知道是否缓存了RDD或Dataframe,可以进入Spark UI - > Storage表并查看内存详细信息。您可以使用 df.unpersist() sqlContext.uncacheTable(sparktable)来移除 df 或内存中的表。
link to read more如果我的数据集大小超过可用内存大小,数据将存储到
的位置?


如果RDD不适合内存,某些分区将不会被缓存,并且每次都会重新计算,当他们需要时。
链接阅读更多 p>


如果我们说RDD已经在RAM中,意味着它在内存中,那么persist()需要什么? - 根据评论


要回答您的问题,在RDD上触发任何操作时,如果该操作找不到内存,它可以删除未缓存/未加载的RDD。



通常,我们坚持RDD,它需要大量计算或/和洗牌默认的 spark会持续整理RDD以避免代价高昂的网络I / O ),因此,当对持久RDD执行任何操作时,只需执行该操作,而不是从每条血统图开始计算, 在此检查RDD持久性级别


We all know Spark does the computation in memory. I am just curious on followings.

  1. If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDDs data will reside on Spark Workers Memory?

  2. If I do not delete RDD, will it be in memory forever?

  3. If my dataset(file) size exceeds available RAM size, where will data to stored?

解决方案

If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDD data will reside on Spark Memory?

Yes, All 10 RDDs data will spread in spark worker machines RAM. but not necessary to all machines must have a partition of each RDD. off course RDD will have data in memory only if any action performed on it as it's lazily evaluated.

If I do not delete RDD, will it be in memory forever?

Spark Automatically unpersist the RDD or Dataframe if they are no longer used. In order to know if an RDD or Dataframe is cached, you can get into the Spark UI -- > Storage table and see the Memory details. You can use df.unpersist() or sqlContext.uncacheTable("sparktable") to remove the df or tables from memory. link to read more

If my dataset size exceeds available RAM size, where will data to stored?

If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time, when they're needed. link to read more

If we are saying RDD is already in RAM, meaning it is in memory, what is the need to persist()? --As per comment

To answer your question, when any action triggered on RDD and if that action could not find memory, it can remove uncached/unpersisted RDDs.

In general, we persist RDD which need a lot of computation or/and shuffling (by default spark persist shuffled RDDs to avoid costly network I/O), so that when any action performed on persisted RDD, simply it will perform that action only rather than computing it again from start as per lineage graph, check RDD persistence levels here.

这篇关于Spark RDD - 是否总是在RAM中进行分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆