将Python项目提交到Dataproc作业 [英] Submit a Python project to Dataproc job
问题描述
我有一个python项目,其文件夹具有结构
I have a python project, whose folder has the structure
main_directory - lib - lib.py
- run - script.py
script.py
是
from lib.lib import add_two
spark = SparkSession \
.builder \
.master('yarn') \
.appName('script') \
.getOrCreate()
print(add_two(1,2))
和lib.py
是
def add_two(x,y):
return x+y
我想在GCP中作为Dataproc作业启动.我已经在线检查过,但是我不太清楚该怎么做.我正在尝试使用
I want to launch as a Dataproc job in GCP. I have checked online, but I have not understood well how to do it. I am trying to launch the script with
gcloud dataproc jobs submit pyspark --cluster=$CLUSTER_NAME --region=$REGION \
run/script.py
但是我收到以下错误消息:
But I receive the following error message:
from lib.lib import add_two
ModuleNotFoundError: No module named 'lib.lib'
您能帮我如何在Dataproc上启动工作吗?我发现这样做的唯一方法是删除绝对路径,并将其更改为script.py
:
Could you help me on how I should do to launch the job on Dataproc? The only way I have found to do it is to remove the absolute path, making this change to script.py
:
from lib import add_two
并将工作启动为
gcloud dataproc jobs submit pyspark --cluster=$CLUSTER_NAME --region=$REGION \
--files /lib/lib.py \
/run/script.py
但是,我想避免每次手动手动列出文件的繁琐过程.
However, I would like to avoid the tedious process to list the files manually every time.
按照@Igor的建议,将其打包成一个zip文件,我发现
Following the suggestion of @Igor, to pack in a zip file I have found that
zip -j --update -r libpack.zip /projectfolder/* && spark-submit --py-files libpack.zip /projectfolder/run/script.py
有效.但是,这会将所有文件放在libpack.zip中的同一根文件夹中,因此,如果子文件夹中存在具有相同名称的文件,则将无法使用.
works. However, this puts all files in the same root folder in libpack.zip, so if there were files with the same names in subfolders this would not work.
有什么建议吗?
推荐答案
压缩依赖项-
cd base-path-to-python-modules
zip -qr deps.zip ./* -x script.py
将deps.zip复制到hdfs/gs.提交作业时,请使用uri,如下所示.
Copy deps.zip to hdfs/gs. Use uri when submitting the job as shown below.
使用Dataproc的Python连接器提交python项目(pyspark)
Submit a python project (pyspark) using Dataproc' Python connector
from google.cloud import dataproc_v1
from google.cloud.dataproc_v1.gapic.transports import (
job_controller_grpc_transport)
region = <cluster region>
cluster_name = <your cluster name>
project_id = <gcp-project-id>
job_transport = (
job_controller_grpc_transport.JobControllerGrpcTransport(
address='{}-dataproc.googleapis.com:443'.format(region)))
dataproc_job_client = dataproc_v1.JobControllerClient(job_transport)
job_file = <gs://bucket/path/to/main.py or hdfs://file/path/to/main/job.py>
# command line for the main job file
args = ['args1', 'arg2']
# required only if main python job file has imports from other modules
# can be one of .py, .zip, or .egg.
addtional_python_files = ['hdfs://path/to/deps.zip', 'gs://path/to/moredeps.zip']
job_details = {
'placement': {
'cluster_name': cluster_name
},
'pyspark_job': {
'main_python_file_uri': job_file,
'args': args,
'python_file_uris': addtional_python_files
}
}
res = dataproc_job_client.submit_job(project_id=project_id,
region=region,
job=job_details)
job_id = res.reference.job_id
print(f'Submitted dataproc job id: {job_id}')
这篇关于将Python项目提交到Dataproc作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!