星火:为什么我要明确地告诉缓存什么呢? [英] Spark: Why do i have to explicitly tell what to cache?

查看:218
本文介绍了星火:为什么我要明确地告诉缓存什么呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在火花,我们每次做上RDD任何操作时,RDD重新计算。所以,如果我们知道的是,RDD是要重复使用更多的,我们应该明确缓存RDD。

让我们说,星火决定懒洋洋地缓存所有RDDS并采用LRU自动保存在内存中最相关的RDDS(这是大多数缓存是如何工作的任何方式)。这将是对开发商有很大的帮助,因为他没有考虑缓存和专注于应用程序。也看不到它如何产生负面的性能产生影响,因为它是难以跟踪,多少时间的变量(RDD)在侧使用的程序,最程序员将决定缓存大部分RDD的任何方式。

缓存通常是自动发生。采取任一个OS /平台或框架或工具的例子。但随着缓存水平的分布式计算的复杂性,我可能会丢失,为什么缓存不能自动或性能问题。

所以我不明白,为什么我要明确地作为缓存,


  1. 看起来丑陋

  2. 它可以很容易错过

  3. 它很容易被上/下使用。


解决方案

原因主观列表:


  • 在实践中缓存很少需要,是有用的多为迭代算法,打破长期谱系。例如典型的ETL管道可能不需要在所有的缓存。的缓存大部分的RDDS的的绝对不是正确的选择。

  • 没有普遍的缓存策略。实际的选择取决于像内存,磁盘(本地,远程存储服务),文件系统可用金额追索权(内存,磁盘上)和特定应用

  • 在磁盘的持久性是昂贵的,在记忆的持久性提出在JVM上更多的压力,并用星火最宝贵的资源

  • 这是不可能不做出关于应用语义假设自动缓存。特别是:


    • 预期的行为,当数据源发生变化。有没有统一的答案,在许多情况下,它可能无法自动跟踪变更

    • 确定性和不确定性的转换之间的区别和缓存之间选择重新计算


  • 星火缓存比较OS级别的缓存是没有意义的。 OS缓存的主要目的是为了减少等待时间。在火花延迟通常不是最重要的因素和缓存被用于其他目的,如稠度,正确性和减轻压力的系统的不同部分。

  • 如果不是缓存介绍了垃圾收集额外的pressure缓存中不使用堆外存储。 GC成本可能实际上比重新计算数据的成本较高。

这也是值得大家注意的是:


  • 删除缓存的数据使用LRU自动处理

  • 某些数据(如中间洗牌数据)自动持久。我承认,它使一些previous论点至少部分无效。

  • 星火缓存不影响系统级或JVM级别机制

In spark, each time we do any action on an RDD, the RDD is re-computed. So If we know that, the RDD is going to be reused more, we should cache the RDD explicitly.

Let's say, Spark decides to lazily cache all the RDDs and uses LRU to keep most relevant RDDs in memory automatically (which is how most caching works any way). It will be of great help for the developer as he does not have to think about caching and concentrate on the application. Also I do not see how can it negatively impact the performance, as it is difficult to keep track of, how many time a variable (RDD) is used in side the program, most programmer will decide to cache most of the RDD any way.

Caching usually happens automatically. Take the examples of either an OS/platform or a framework or a tool. But with the complexities of level of caching in distributed computing, I might be missing why the caching cannot be automatic or the performance implications.

So I fail to understand, why I have to explicitly cache as,

  1. It looks ugly
  2. It can easily be missed
  3. It can easily be over/under used.

解决方案

A subjective list of reasons:

  • in practice caching is rarely needed and is useful mostly for iterative algorithms, breaking long lineages. For example typical ETL pipelines may not require caching at all. Cache most of the RDDs is definitely not the right choice.
  • there is no universal caching strategy. Actual choice depends on available recourses like amount of memory, disks (local, remote, storage service), file system (in-memory, on-disk) and particular application
  • on-disk persistence is expensive, in memory persistence puts more stress on a JVM and is using the most valuable resource in Spark
  • it is impossible to cache automatically without making assumptions about application semantics. In particular:

    • expected behavior when data source changes. There is no universal answer and in many situations it can be impossible to automatically track changes
    • differentiating between deterministic and non-deterministic transformations and choosing between caching and re-computing
  • comparing Spark caching to OS level caching doesn't make sense. The main goal of OS caching is to reduce latency. In Spark latency is usually not the most important factor and caching is used for other purposes like consistency, correctness and reducing stress on different parts of the system.
  • if cache doesn't use off-heap storage than caching introduces additional pressure on the garbage collector. GC cost can be actually higher than a cost of recomputing the data.

It is also worth to note that:

  • removing cached data is handled automatically using LRU
  • some data (like intermediate shuffle data) is persisted automatically. I acknowledge that it makes some of the previous arguments at least partially invalid.
  • Spark caching doesn't affect system level or JVM level mechanisms

这篇关于星火:为什么我要明确地告诉缓存什么呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆