在Databricks/Spark中记录附加的群集信息 [英] Logging Attached Cluster Information in Databricks / Spark

查看:48
本文介绍了在Databricks/Spark中记录附加的群集信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对Databricks进行一些性能测试.为此,我想记录测试期间我使用的群集(VM类型,例如Standard_DS3_v2)(我们可以假设驱动程序节点和工作程序节点相同).我知道我可以记录工作者的数量,内核(至少在驱动程序上)和内存(至少在驱动程序上)的数量.但是,我想知道VM的类型,因为我希望能够确定我是否使用过例如存储优化或通用集群.除了VM Type,此信息也可以.理想情况下,我可以在笔记本中的变量中以字符串形式获取此信息,以便以后将其与我正在记录的其他信息一起从该文件写入我的日志文件中.但是,如果没有针对此问题的直接解决方案,我也对任何棘手的解决方法感到满意.

I would like to do some performance testing on Databricks. To do this I would like to log what cluster (VM type e.g. Standard_DS3_v2) I was using during the test (we can assume that the driver and worker nodes are the same). I know I could log the no of workers, no of cores (on the driver at least) and the memory (on the driver at least). However, I would like to know the VM type since I want to be able to identify if I used e.g. a storage optimized or general purpose cluster. Instead of the VM Type this information would also be fine. Optimally, I can get this information as a string in a variable within the notebook to later write it into my log file from there with other information I am logging. However, I am also happy with any hacky workaround if there is no straight forward solution to this.

推荐答案

您可以通过

You can get this information from the REST API, via GET request to Clusters API. You can use notebook context to identify the cluster where the notebook is running via dbutils.notebook.getContext call that returns a map of different attributes, including the cluster ID, workspace domain name, and you can extract the authentication token from it. Here is the code that prints driver & worker node types (it's in Python, but Scala code should be familiar - I often use Scala's dbutils.notebook.getContext.tags to find what tags are available):

import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
host_name = ctx.tags().get("browserHostName").get()
host_token = ctx.apiToken().get()
cluster_id = ctx.tags().get("clusterId").get()

response = requests.get(
    f'https://{host_name}/api/2.0/clusters/get?cluster_id={cluster_id}',
    headers={'Authorization': f'Bearer {host_token}'}
  ).json()
print(f"driver type={response['driver_node_type_id']} worker type={response['node_type_id']}")

这篇关于在Databricks/Spark中记录附加的群集信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆