Spark RDD - 是否总是在RAM中进行分区? [英] Spark RDD - is partition(s) always in RAM?
问题描述
-
如果我创建10
RDD
在HDFS的pySpark shell中,是否所有这些10RDD
的数据都驻留在Spark Workers Memory上?如果我不删除RDD
,它会永远在内存中吗? / li> -
如果我的数据集(文件)大小超过了可用的RAM大小,数据将存储在哪里?
如果我在HDFS的pySpark shell中创建10个RDD,是否表示所有这10个RDD
数据将驻留在Spark Memory上
是,所有10个RDD数据将在Spark Worker计算机中传播RAM 即可。但并非所有机器都必须具有每个RDD的分区。 off课程RDD只有在对其进行任何操作时才会在内存中存储数据。
如果我不删除RDD,它会永久存储在内存中吗?
Spark自动取消RDD或Dataframe 如果他们不再使用。为了知道是否缓存了RDD或Dataframe,可以进入Spark UI - > Storage表并查看内存详细信息。您可以使用 df.unpersist()
或 sqlContext.uncacheTable(sparktable)
来移除 df
或内存中的表。
link to read more如果我的数据集大小超过可用内存大小,数据将存储到
的位置?
如果RDD不适合内存,某些分区将不会被缓存,并且每次都会重新计算,当他们需要时。
链接阅读更多 p>
如果我们说RDD已经在RAM中,意味着它在内存中,那么persist()需要什么? - 根据评论
要回答您的问题,在RDD上触发任何操作时,如果该操作找不到内存,它可以删除未缓存/未加载的RDD。
通常,我们坚持RDD,它需要大量计算或/和洗牌默认的 spark会持续整理RDD以避免代价高昂的网络I / O ),因此,当对持久RDD执行任何操作时,只需执行该操作,而不是从每条血统图开始计算, 在此检查RDD持久性级别。
We all know Spark does the computation in memory. I am just curious on followings.
If I create 10
RDD
in my pySpark shell from HDFS, does it mean all these 10RDD
s data will reside on Spark Workers Memory?If I do not delete
RDD
, will it be in memory forever?If my dataset(file) size exceeds available RAM size, where will data to stored?
解决方案If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDD data will reside on Spark Memory?
Yes, All 10 RDDs data will spread in spark worker machines RAM. but not necessary to all machines must have a partition of each RDD. off course RDD will have data in memory only if any action performed on it as it's lazily evaluated.
If I do not delete RDD, will it be in memory forever?
Spark Automatically unpersist the RDD or Dataframe if they are no longer used. In order to know if an RDD or Dataframe is cached, you can get into the Spark UI -- > Storage table and see the Memory details. You can use
df.unpersist()
orsqlContext.uncacheTable("sparktable")
to remove thedf
or tables from memory. link to read moreIf my dataset size exceeds available RAM size, where will data to stored?
If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time, when they're needed. link to read more
If we are saying RDD is already in RAM, meaning it is in memory, what is the need to persist()? --As per comment
To answer your question, when any action triggered on RDD and if that action could not find memory, it can remove uncached/unpersisted RDDs.
In general, we persist RDD which need a lot of computation or/and shuffling (by default spark persist shuffled RDDs to avoid costly network I/O), so that when any action performed on persisted RDD, simply it will perform that action only rather than computing it again from start as per lineage graph, check RDD persistence levels here.
这篇关于Spark RDD - 是否总是在RAM中进行分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!