Databricks中使用的集群管理器是什么?如何更改Databricks群集中执行程序的数量? [英] what is the cluster manager used in Databricks ? How do I change the number of executors in Databricks clusters?

查看:258
本文介绍了Databricks中使用的集群管理器是什么?如何更改Databricks群集中执行程序的数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Databricks中使用的集群管理器是什么? 如何更改Databricks群集中执行程序的数量?

解决方案

Databricks中使用的集群管理器是什么?

Azure Databricks通过提供一个零管理云平台构建Spark的功能,该平台包括:

  • 完全托管的Spark集群
  • 用于探索和可视化的交互式工作区
  • 为您最喜欢的基于Spark的应用程序提供动力的平台

Databricks运行时基于Apache Spark构建,并且是为Azure云而原生构建的.

使用无服务器选项,Azure Databricks完全抽象出了基础架构的复杂性以及对建立和配置数据基础架构的专业知识的需求.无服务器选项可帮助数据科学家团队快速迭代.

对于关心生产作业性能的数据工程师而言,Azure Databricks提供了一个Spark引擎,该引擎通过I/O层和处理层(Databricks I/O)的各种优化而更快,性能更高.

如何更改Databricks群集中执行程序的数量?

创建集群时,可以为集群提供固定数量的工作人员,也可以为集群提供最小和最大数量的工作人员.

当您提供固定大小的群集时::Azure Databricks确保您的群集具有指定数量的工作线程.当您提供工作人员数量的范围时,Databricks会选择运行您的工作所需的适当工作人员数量.这称为自动缩放.

具有自动缩放功能:Azure Databricks动态重新分配工作人员以说明您的工作特征.您管道中的某些部分可能比其他部分对计算的要求更高,并且Databricks在工作的这些阶段会自动添加其他工作人员(并在不再需要时将其删除).

自动缩放可轻松实现较高的群集利用率,因为您无需配置群集即可匹配工作负载.这尤其适用于需求随时间变化的工作负载(例如一天中探索数据集),但也适用于配置需求未知的一次性较短的工作负载.因此,自动缩放具有两个优点:

  • 与恒定大小的资源不足群集相比,工作负载可以运行得更快.
  • 与静态大小的群集相比,自动扩展群集可以降低总体成本.

注意:根据群集的恒定大小和工作负载,自动扩展可同时为您带来这两项好处之一或全部.当云提供商终止实例时,群集大小可能会低于选择的最小工作程序数.在这种情况下,Azure Databricks会不断重试以重新配置实例,以保持最少的工作人员数量.

集群自动缩放不适用于提交火花的作业.要了解有关自动缩放的更多信息,请参见集群自动缩放.

希望这会有所帮助.

What is the cluster manager used in Databricks? How do I change the number of executors in Databricks clusters ?

解决方案

What is the cluster manager used in Databricks?

Azure Databricks builds on the capabilities of Spark by providing a zero-management cloud platform that includes:

  • Fully managed Spark clusters
  • An interactive workspace for exploration and visualization
  • A platform for powering your favorite Spark-based applications

The Databricks Runtime is built on top of Apache Spark and is natively built for the Azure cloud.

With the Serverless option, Azure Databricks completely abstracts out the infrastructure complexity and the need for specialized expertise to set up and configure your data infrastructure. The Serverless option helps data scientists iterate quickly as a team.

For data engineers, who care about the performance of production jobs, Azure Databricks provides a Spark engine that is faster and performant through various optimizations at the I/O layer and processing layer (Databricks I/O).

How do I change the number of executors in Databricks clusters ?

When you create a cluster, you can either provide a fixed number of workers for the cluster or provide a minimum and maximum number of workers for the cluster.

When you provide a fixed size cluster: Azure Databricks ensures that your cluster has the specified number of workers. When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job. This is referred to as autoscaling.

With autoscaling: Azure Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job (and removes them when they’re no longer needed).

Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the cluster to match a workload. This applies especially to workloads whose requirements change over time (like exploring a dataset during the course of a day), but it can also apply to a one-time shorter workload whose provisioning requirements are unknown. Autoscaling thus offers two advantages:

  • Workloads can run faster compared to a constant-sized under-provisioned cluster.
  • Autoscaling clusters can reduce overall costs compared to a statically-sized cluster.

Note: Depending on the constant size of the cluster and the workload, autoscaling gives you one or both of these benefits at the same time. The cluster size can go below the minimum number of workers selected when the cloud provider terminates instances. In this case, Azure Databricks continuously retries to re-provision instances in order to maintain the minimum number of workers.

Cluster autoscaling is not available for spark-submit jobs. To learn more about autoscaling, see Cluster autoscaling.

Hope this helps.

这篇关于Databricks中使用的集群管理器是什么?如何更改Databricks群集中执行程序的数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆