Jupyter笔记本执行器是否在Apache Spark中动态分布? [英] Are Jupyter notebook executors distributed dynamically in Apache Spark?

查看:84
本文介绍了Jupyter笔记本执行器是否在Apache Spark中动态分布?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了更好地理解Apache Hadoop Spark中的大数据概念,我提出了一个问题.不知道它是否在本论坛中脱颖而出,但请告诉我.

I got a question in order to better understand a big data concept within Apache Hadoop Spark. Not sure if it's off-topic in this forum, but let me know.

想象一下一个由8个由Yarn资源管理器管理的服务器的Apache Hadoop集群.我将文件上传到配置为64MB块大小且复制计数为3的HDFS(文件系统)中.然后将该文件拆分为64MB的块.现在,让我们想象一下,这些块是由HDFS分发到节点1、2和3上的.

Imagine a Apache Hadoop cluster with 8 servers managed by the Yarn resource manager. I uploaded a file into HDFS (file system) that is configured with 64MB blocksize and a replication count of 3. That file is then split into blocks of 64MB. Now let's imagine the blocks are distributed by HDFS onto node 1, 2 and 3.

但是现在我正在用Jupyter笔记本编写一些Python代码.因此,使用以下命令启动笔记本计算机:

But now I'm coding some Python code with a Jupyter notebook. Therefore the notebook is started with this command:

PYSPARK_DRIVER_PYTHON = jupyter PYSPARK_DRIVER_PYTHON_OPTS =笔记本" pyspark --master yarn-client --num-executors 3 --executor-cores 4 --executor-内存16G

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master yarn-client --num-executors 3 --executor-cores 4 --executor-memory 16G

在笔记本中,我正在从HDFS加载文件以进行一些分析.当我执行我的代码时,我可以在YARN Web-UI中看到我有3个执行者,以及如何将作业提交(分发)给执行者.

Within the notebook I'm loading the file from HDFS to do some analytics. When I executed my code, I can see in the YARN Web-UI that I got 3 executors and how the jobs are submitted (distributed) to the executors.

有趣的部分是,我的执行器在启动命令之后立即固定在特定的计算节点上(请参见上文).例如节点6、7和8.

The interesting part is, that my executors are fixed to specific computing nodes right after the start command (see above). For instance node 6, 7 and 8.

我的问题是:

  1. 我的假设是正确的,即执行器节点固定在计算节点上,并且一旦我从HDFS访问(加载)文件,HDFS块便会转移到执行器上?
  2. 或者,是在数据所在的节点(节点1、2和3)上动态分配执行程序并在其上启动执行程序.在这种情况下,我在YARN网络用户界面中的观察肯定是错误的.
  1. Is my assumption correct, that the executor nodes are fixed to computing nodes and the HDFS blocks will be transferred to the executors once I'm accessing (loading) the file from HDFS?
  2. Or, are the executors dynamically assigned and started at the nodes where the data is (node 1, 2 and 3). In this case my observation in the YARN web-ui must be wrong.

我真的很想更好地理解这一点.

I'm really interested in understanding this better.

推荐答案

Jupyter笔记本执行器是否在Apache Spark中动态分布

Are Jupyter notebook executors distributed dynamically in Apache Spark

为清楚起见,让我们区分

For the sake of clarity, let's distinguish

  • Jupyter笔记本及其相关内核-内核是笔记本UI背后的Python进程.内核执行您键入并提交到笔记本中的任何代码.内核由Jupyter管理,而不是由Spark管理.

  • Jupyter notebooks and their associated kernels - a kernel is the Python process behind a notebook's UI. A kernel executes whatever code you type and submit in your notebook. Kernels are managed by Jupyter, not by Spark.

Spark执行程序-这些是在YARN群集上分配的用于执行Spark作业的计算资源

Spark executors - these are the compute resources allocated on the YARN cluster to execute spark jobs

HDFS数据节点-这些是数据所在的位置.数据节点可能与执行者节点相同或不同.

HDFS data nodes - these are where your data resides. Data nodes may or may not be the same as executor nodes.

我的假设是正确的,即执行器节点固定在计算节点上,并且一旦我从HDFS访问(加载)文件,HDFS块便会转移到执行器上

Is my assumption correct, that the executor nodes are fixed to computing nodes and the HDFS blocks will be transferred to the executors once I'm accessing (loading) the file from HDFS

是,不是-是,Spark需要数据位置在选择工作时要考虑在内.不,没有保证.根据 Spark文档:

Yes and no - yes, Spark takes data locality into account when scheulding jobs. No, there is no guarantee. As per Spark documentation:

(...)有两种选择:a)等待繁忙的CPU释放以在同一服务器上的数据上启动任务,或b)立即在更远的地方启动新任务需要将数据移到那里. Spark通常要做的是稍等一下,以期释放繁忙的CPU.一旦超时到期,它将开始将数据从较远的地方移至空闲的CPU. (...)

(...) there are two options: a) wait until a busy CPU frees up to start a task on data on the same server, or b) immediately start a new task in a farther away place that requires moving data there. What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU. (...)

或者,执行程序是在数据所在的节点(节点1、2和3)上动态分配并启动的.

Or, are the executors dynamically assigned and started at the nodes where the data is (node 1, 2 and 3).

这取决于配置.通常,执行器会动态分配给Spark应用程序(即SparkContext),并在不再使用时被释放.但是,根据工作安排,执行者可以存活一段时间.文档:

This depends on the configuration. In general executors are allocated to a spark application (i.e. a SparkContext) dynamically, and deallocated when no longer used. However, executors are kept alive for some time, as per the Job scheduling documentation:

(...)Spark应用程序在空闲时间超过spark.dynamicAllocation.executorIdleTimeout秒以上的情况下将其删除.(...)

要获得对在哪里运行的内容的更多控制,可以使用

To get more control on what runs where, you may use Scheduler Pools.

这篇关于Jupyter笔记本执行器是否在Apache Spark中动态分布?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆