Spark-如何在节点周围分配数据? [英] Spark - how does it distribute data around the nodes?

查看:82
本文介绍了Spark-如何在节点周围分配数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

spark如何将数据分发给工作人员?

How does spark distribute data to workers?

工作程序是从数据源读取还是驱动程序读取了数据并将其发送给工作程序?当一个工作人员需要另一个工作人员中的数据时,他们是否可以直接通信?

Do the workers read from the data source, or does the driver reads it and sends it to workers? And when a worker needs data that is in another worker, do they communicate directly?

谢谢!

推荐答案

如果您使用

If you use distributed input methods like SparkContext.textFile then workers read directly from your data source (or if you explicitly open HDFS files from inside worker task code then of course those will also occur on the workers).

如果您在主驱动程序中手动读取数据,然后使用SparkContext.parallelize,则实际上您的驱动程序将向工作人员发送数据.

If you manually read data in on your main driver program, and then used SparkContext.parallelize, then indeed your driver will be sending data to your workers.

工人之间的数据依存关系通常称为随机播放;在大多数情况下,这种类型的工人之间的通信是大多数大数据处理系统的核心,这恰恰是因为高效而可靠地进行操作很棘手.从概念上讲,您可以将它或多或少地视为直接通信",但是根据数据依赖的处理方式,幕后可能还会有更多事情发生.

Data dependencies from worker to worker are generally referred to as the shuffle; this type of worker-to-worker communication is in a lot of ways the heart of most big data processing systems, precisely because it's tricky to do efficiently and reliably. Conceptually you can treat it more-or-less as "communicating directly", but there may be a lot more going on under the hood depending on how the data dependency is taken on.

这篇关于Spark-如何在节点周围分配数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆