在Azure Databricks作业中运行python包.egg [英] Running python package .egg in Azure Databricks Job

查看:77
本文介绍了在Azure Databricks作业中运行python包.egg的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用构建工具(setuptools)将我的python代码打包为.egg格式.我想通过azure数据块中的作业来运行此程序包.

Using build tool (setuptools) packaged my python code as .egg format. I wanted to run this package through job in azure data-bricks.

我可以通过以下命令在本地计算机上执行程序包.

I can able to execute the package in my local machine through below commands.

spark-submit --py-files ./dist/hello-1.0-py3.6.egg hello/pi.py

1)如下将软件包复制到DBFS路径中,

1) Copied the package into DBFS path as follows,

工作区->用户->创建->库->库源(DBFS)->库类型(Python Egg)->上传

2)在新的集群模式下创建了一个任务为火花提交"的作业

2) Created a job with task as spark-submit on new cluster mode

3)为任务配置了以下参数,

3) Below parameters are configured for the task,

[-py-files","dbfs:/FileStore/jars/8c1231610de06d96-hello_1_0_py3_6-70b16.egg","hello/pi.py"]

实际:/databricks/python/bin/python:无法打开文件'/databricks/driver/hello/hello.py':[错误2]没有此类文件或目录

Actual: /databricks/python/bin/python: can't open file '/databricks/driver/hello/hello.py': [Errno 2] No such file or directory

预期:作业应成功执行.

Expected: Job should execute successfully.

推荐答案

我能做到这一点的唯一方法是使用API​​创建

The only way I've got this to work is by using the API to create a Python Job. The UI does not support this for some reason.

我使用 PowerShell 与API配合使用-这是一个示例用适合我的鸡蛋创造工作:

I use PowerShell to work with the API - this is an example that creates a job using an egg which works for me:

$Lib = '{"egg":"LOCATION"}'.Replace("LOCATION", "dbfs:$TargetDBFSFolderCode/pipelines.egg")
$ClusterId = "my-cluster-id"
$j = "sample"
$PythonParameters = "pipelines.jobs.cleansed.$j"
$MainScript = "dbfs:" + $TargetDBFSFolderCode + "/main.py"
Add-DatabricksDBFSFile -BearerToken $BearerToken -Region $Region -LocalRootFolder "./bin/tmp" -FilePattern "*.*"  -TargetLocation $TargetDBFSFolderCode -Verbose
Add-DatabricksPythonJob -BearerToken $BearerToken -Region $Region -JobName "$j-$Environment" -ClusterId $ClusterId `
    -PythonPath $MainScript -PythonParameters $PythonParameters -Libraries $Lib -Verbose

这会将我的main.py和pipelines.egg复制到DBFS,然后创建一个指向它们的工作并传递参数.

That copies my main.py and pipelines.egg to DBFS then creates a job pointed at them passing in a parameter.

关于Databricks上的鸡蛋的一件令人讨厌的事情-您必须先卸载并重新启动群集,然后群集才能选择要部署的任何新版本.

One annoying thing about eggs on Databricks - you must uninstall and restart the cluster before it picks up any new versions that you deploy.

如果您使用工程集群,这不是问题.

If you use an engineering cluster this is not an issue.

这篇关于在Azure Databricks作业中运行python包.egg的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆