如何定义一个全局的读\\写星火变量 [英] How to define a global read\write variables in Spark

查看:180
本文介绍了如何定义一个全局的读\\写星火变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

星火具有广播变量,这些变量是只读的,而累加器变量,它可以是由节点更新,但不读。是否有办法 - 或者一种解决方法 - 定义一个变量,它是既可更新和可阅读

Spark has broadcast variables, which are read only, and accumulator variables, which can be updates by the nodes, but not read. Is there way - or a workaround - to define a variable which is both updatable and can be read?

对于这样一个读一个要求\\写的全局变量将是实现高速缓存。随着文件被加载并作为RDD的处理,计算被执行。这些计算的结果 - 在并行运行几个节点发生 - 需要被放入一个地图,其具有作为其主要的一些实体的属性被处理。随着RDD的内后续的实体处理,缓存查询。

One requirement for such a read\write global variable would be to implement a cache. As files are loaded and processed as rdd's, calculations are performed. The results of these calculations - happening in several nodes running in parallel - need to be placed into a map, which has as it's key some of the attributes of the entity being processed. As subsequent entities within the rdd's are processed, the cache is queried.

Scala并具有 ScalaCache ,这是缓存实现,如谷歌番石榴外观。但是,如何将这样一个高速缓存包括与星火应用程序中访问?

Scala does have ScalaCache, which is a facade for cache implementations such as Google Guava. But how would such a cache be included and accessed within a Spark application?

缓存可以定义为在驱动器应用它创建 SparkContext 的变量。但随后会有两个问题:

The cache could be defined as a variable in the driver application which creates the SparkContext. But then there would be two issues:


  • 性能将presumably因网络开销是坏
    节点和驱动器应用程序之间。

  • 要我的理解,每个R​​DD将通过该变量的副本
    (在这种情况下,高速缓存中)当变量首先被访问的
    功能传递给RDD。每个RDD将有它自己的副本,而不是访问一个共享的全局变量。

什么是实现和存储这样的高速缓存的最佳方式?

What is the best way to implement and store such a cache?

感谢

推荐答案

好吧,这样做的最好办法是不是做的这一切。一般来说星火处理模型不提供有关任何保证

Well, the best way of doing this is not doing at it all. In general Spark processing model doesn't provide any guarantees regarding


  • 其中,

  • 时,

  • 在什么样的顺序(当然不包括由血统/ DAG定义的转换的顺序)

  • 多少次

code给出一段被执行。此外,这直接取决于Spark架构的任何更新,都没有颗粒。

given piece of code is executed. Moreover, any updates which depend directly on the Spark architecture, are not granular.

这些都是使星火可扩展性和弹性,但同时这也是使得保持共享的可变状态很难实现,大部分的时间完全没有用处的东西的属性。

These are the properties which make Spark scalable and resilient but at the same this is the thing that makes keeping shared mutable state very hard to implement and most of the time completely useless.

如果你想要的是一个简单的缓存,那么你有多种选择:

If all you want is a simple cache then you have multiple options:


  • 使用中的 Tzach琐描述的方法之一://计算器。 COM / q /一百五十六万零六十二分之三千六百三十万五千四百二十二>缓存中的星火

  • 使用本地缓存(每个JVM或执行程序线程)应用程序特定的分区的结合让事情当地

  • 与外部系统的通信节点使用本地缓存的独立星火(HTTP请求例如Nginx的代理)

  • use one of the methods described by Tzach Zohar in Caching in Spark
  • use local caching (per JVM or executor thread) combined with application specific partitioning to keep things local
  • for communication with external systems use node local cache independent of Spark (for example Nginx proxy for http requests)

如果应用程序需要更复杂的通信,你可以尝试不同的消息传递工具,以保持同步状态,但在总体上需要一个复杂的和潜在的脆弱code。

If application requires much more complex communication you may try different message passing tools to keep synchronized state but in general it requires a complex and potentially fragile code.

这篇关于如何定义一个全局的读\\写星火变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆