Spark是否会自动缓存某些结果? [英] Does spark automatically cache some results?

查看:244
本文介绍了Spark是否会自动缓存某些结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我执行了两次操作,第二次花费很少的时间运行,因此我怀疑spark自动缓存了一些结果.但是我确实找到了任何来源.

I run an action two times, and the second time takes very little time to run, so I suspect that spark automatically cache some results. But I did find any source.

我正在使用Spark1.4.

Im using Spark1.4.

doc = sc.textFile('...')
doc_wc = doc.flatMap(lambda x: re.split('\W', x))\
            .filter(lambda x: x != '') \
            .map(lambda word: (word, 1)) \
            .reduceByKey(lambda x,y: x+y) 
%%time
doc_wc.take(5) # first time
# CPU times: user 10.7 ms, sys: 425 µs, total: 11.1 ms
# Wall time: 4.39 s

%%time
doc_wc.take(5) # second time
# CPU times: user 6.13 ms, sys: 276 µs, total: 6.41 ms
# Wall time: 151 ms

推荐答案

来自

Spark还可以自动将某些中间数据持久存储在随机操作中(例如reduceByKey),即使用户没有调用持久存储.这样做是为了避免在改组期间节点发生故障时重新计算整个输入.我们仍然建议用户如果打算重复使用它,则对结果的RDD调用持久化.

Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

基础文件系统也将缓存对磁盘的访问.

The underlying filesystem will also be caching access to the disk.

这篇关于Spark是否会自动缓存某些结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆