将Python项目提交到Dataproc作业 [英] Submit a Python project to Dataproc job

查看:98
本文介绍了将Python项目提交到Dataproc作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个python项目,其文件夹具有结构

I have a python project, whose folder has the structure

main_directory - lib - lib.py
               - run - script.py

script.py

from lib.lib import add_two
spark = SparkSession \
    .builder \
    .master('yarn') \
    .appName('script') \
    .getOrCreate()

print(add_two(1,2))

lib.py

def add_two(x,y):
    return x+y

我想在GCP中作为Dataproc作业启动.我已经在线检查过,但是我不太清楚该怎么做.我正在尝试使用

I want to launch as a Dataproc job in GCP. I have checked online, but I have not understood well how to do it. I am trying to launch the script with

gcloud dataproc jobs submit pyspark --cluster=$CLUSTER_NAME --region=$REGION \
  run/script.py

但是我收到以下错误消息:

But I receive the following error message:

from lib.lib import add_two
ModuleNotFoundError: No module named 'lib.lib'

您能帮我如何在Dataproc上启动工作吗?我发现这样做的唯一方法是删除绝对路径,并将其更改为script.py:

Could you help me on how I should do to launch the job on Dataproc? The only way I have found to do it is to remove the absolute path, making this change to script.py:

 from lib import add_two

并将工作启动为

gcloud dataproc jobs submit pyspark --cluster=$CLUSTER_NAME --region=$REGION \
  --files /lib/lib.py \
  /run/script.py

但是,我想避免每次手动手动列出文件的繁琐过程.

However, I would like to avoid the tedious process to list the files manually every time.

按照@Igor的建议,将其打包成一个zip文件,我发现

Following the suggestion of @Igor, to pack in a zip file I have found that

zip -j --update -r libpack.zip /projectfolder/* && spark-submit --py-files libpack.zip /projectfolder/run/script.py

有效.但是,这会将所有文件放在libpack.zip中的同一根文件夹中,因此,如果子文件夹中存在具有相同名称的文件,则将无法使用.

works. However, this puts all files in the same root folder in libpack.zip, so if there were files with the same names in subfolders this would not work.

有什么建议吗?

推荐答案

压缩依赖项-

cd base-path-to-python-modules
zip -qr deps.zip ./* -x script.py

将deps.zip复制到hdfs/gs.提交作业时,请使用uri,如下所示.

Copy deps.zip to hdfs/gs. Use uri when submitting the job as shown below.

使用Dataproc的Python连接器提交python项目(pyspark)

Submit a python project (pyspark) using Dataproc' Python connector

from google.cloud import dataproc_v1
from google.cloud.dataproc_v1.gapic.transports import (
    job_controller_grpc_transport)

region = <cluster region>
cluster_name = <your cluster name>
project_id = <gcp-project-id>

job_transport = (
    job_controller_grpc_transport.JobControllerGrpcTransport(
        address='{}-dataproc.googleapis.com:443'.format(region)))
dataproc_job_client = dataproc_v1.JobControllerClient(job_transport)

job_file = <gs://bucket/path/to/main.py or hdfs://file/path/to/main/job.py>

# command line for the main job file
args = ['args1', 'arg2']

# required only if main python job file has imports from other modules
# can be one of .py, .zip, or .egg. 
addtional_python_files = ['hdfs://path/to/deps.zip', 'gs://path/to/moredeps.zip']

job_details = {
    'placement': {
        'cluster_name': cluster_name
    },
    'pyspark_job': {
        'main_python_file_uri': job_file,
        'args': args,
        'python_file_uris': addtional_python_files
    }
}

res = dataproc_job_client.submit_job(project_id=project_id,
                                     region=region, 
                                     job=job_details)
job_id = res.reference.job_id

print(f'Submitted dataproc job id: {job_id}')

这篇关于将Python项目提交到Dataproc作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆