spark是否可以优化广播变量的网络流量? [英] does spark optimize the network traffic for broadcasted variables?

查看:140
本文介绍了spark是否可以优化广播变量的网络流量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

知道spark在每个工作节点上使用多个执行程序,并且每个执行程序都在其自己的JVM中运行,我想知道/if spark如何优化广播变量的网络流量.希望它对每个工作节点进行一次下载,然后将已经序列化的数据发送到该特定节点上的执行器.另一种选择是每次执行者需要时都下载广播的数据(因此必须在特定节点上多次下载相同的数据).

knowing that spark uses multiple executors per worker node and that each executor runs in its own JVM, I wonder how /if does spark optimize the network traffic for broadcasted variables. Hopefully it does one single download for each worker node and then sends the already serialized data to the executors on that particular node. The other option would be to download the broadcasted data each time an executor needs it (therefore having to download multiple times the same data on a particular node).

推荐答案

是的,Spark确实使用洪流广播来优化广播.引用

Yes, Spark does optimize broadcasting using torrent broadcasts. To quote the source

* A BitTorrent-like implementation of [[org.apache.spark.broadcast.Broadcast]].
*
* The mechanism is as follows:
*
* The driver divides the serialized object into small chunks and
* stores those chunks in the BlockManager of the driver.
*
* On each executor, the executor first attempts to fetch the object from its BlockManager. If
* it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or
* other executors if available. Once it gets the chunks, it puts the chunks in its own
* BlockManager, ready for other executors to fetch from.
*
* This prevents the driver from being the bottleneck in sending out multiple copies of the
* broadcast data (one per executor).

过去,存在另一个广播实现(HTTP广播),但是在2.0中已将其完全删除.

In the past there was another broadcast implementation (HTTP broadcast), but it was removed completely in 2.0.

这篇关于spark是否可以优化广播变量的网络流量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆