在火花中,广播如何工作? [英] In spark, how does broadcast work?

查看:70
本文介绍了在火花中,广播如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个非常简单的问题:在spark中,broadcast可用于有效地将变量发送给执行者.这是怎么工作的?

This is a very simple question: in spark, broadcast can be used to send variables to executors efficiently. How does this work ?

更准确地说:

  • 何时发送值:何时我调用broadcast或何时使用这些值?
  • 将数据发送到哪里:发给所有执行者,或只发给需要执行者的人?
  • 数据存储在哪里?在内存中还是在磁盘上?
  • 访问简单变量和广播变量的方式是否有所不同?当我调用.value方法时,幕后将发生什么?
  • when are values sent : as soon as I call broadcast, or when the values are used ?
  • Where exactly is the data sent : to all executors, or only to the ones that will need it ?
  • where is the data stored ? In memory, or on disk ?
  • Is there a difference in how simple variables and broadcast variables are accessed ? What happens under the hood when I call the .value method ?

推荐答案

简短答案

  • 在执行程序中首次需要它们时发送值.调用sc.broadcast(variable)时不发送任何消息.
  • 数据仅发送到包含需要执行者的节点.
  • 数据存储在内存中.如果没有足够的可用内存,则使用磁盘.
  • 是的,访问局部变量和广播变量之间有很大的区别.广播变量必须在首次访问时下载.
  • Short answer

    • Values are sent the first time they are needed in an executor. Nothing is sent when sc.broadcast(variable) is called.
    • The data is sent only to the nodes that contain an executor that needs it.
    • The data is stored in memory. If not enough memory is available, the disk is used.
    • Yes, there is a big difference between accessing a local variable and a broadcast variable. Broadcast variables have to be downloaded the first time they are accessed.
    • 答案在Spark的源代码中,位于

      The answer is in Spark's source, in TorrentBroadcast.scala.

      1. 调用sc.broadcast时,将从BroadcastFactory.scala实例化一个新的TorrentBroadcast对象.以下情况发生在

      1. When sc.broadcast is called, a new TorrentBroadcast object is instantiated from BroadcastFactory.scala. The following happens in writeBlocks(), which is called when the TorrentBroadcast object is initialized:

      1. 使用 MEMORY_AND_DISK 政策.
      2. 已序列化.
      3. 序列化的版本分为4Mb块,压缩后的 [0] ,并在本地保存 [1] .
      1. The object is cached unserialized locally using the MEMORY_AND_DISK policy.
      2. It is serialized.
      3. The serialized version is split into 4Mb blocks, that are compressed[0], and saved locally[1].

    • 创建新的执行程序时,它们仅具有轻量级的TorrentBroadcast对象,该对象仅包含广播对象的标识符及其块数.

    • When new executors are created, they only have the lightweight TorrentBroadcast object, that only contains the broadcast object's identifier, and its number of blocks.

      TorrentBroadcast对象具有包含其值的lazy [2] 属性.调用value方法时,将返回此惰性属性.因此,第一次在任务上调用此value函数时,会发生以下情况:

      The TorrentBroadcast object has a lazy[2] property that contains its value. When the value method is called, this lazy property is returned. So the first time this value function is called on a task, the following happens:

      1. 以随机顺序,从本地块管理器中获取块并进行未压缩.
      2. 如果它们不在本地块管理器中,请 MEMORY_AND_DISK_SER .
      1. In a random order, blocks are fetched from the local block manager and uncompressed.
      2. If they are not present in the local block manager, getRemoteBytes is called on the block manager to fetch them. Network traffic happens only at that time.
      3. If the block wasn't present locally, it is cached using MEMORY_AND_DISK_SER.


    • [0] 默认情况下使用lz4压缩. 可以调整.


      [0] Compressed with lz4 by default. This can be tuned.

      [1] 块存储在本地块管理器,使用 ,这意味着它将内存中不适合的分区溢出到磁盘上.每个块都有一个唯一的标识符,该标识符是根据广播变量的标识符及其偏移量计算得出的.可以配置块的大小 ;默认为4Mb.

      [1] The blocks are stored in the local block manager, using MEMORY_AND_DISK_SER, which means that it spills partitions that don't fit in memory to disk. Each block has an unique identifier, computed from the identifier of the broadcast variable, and its offset. The size of blocks can be configured; it is 4Mb by default.

      [2] scala中的 lazy val是一个变量,其值在首次访问时被评估,然后被缓存. 请参阅文档

      [2] A lazy val in scala is a variable whose value is evaluated the first time it is accessed, and then cached. See the documentation.

      这篇关于在火花中,广播如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆