Sagemaker 使用 PySpark 和 Step Functions 处理作业 [英] Sagemaker processing job with PySpark and Step Functions

查看:83
本文介绍了Sagemaker 使用 PySpark 和 Step Functions 处理作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的问题:我必须使用在 PySpark 中编写的自定义代码运行 Sagemaker 处理作业.我通过运行以下命令使用了 Sagemaker SDK:

this is my problem: I have to run a Sagemaker processing job using custom code written in PySpark. I've used the Sagemaker SDK by running these commands:

spark_processor = sagemaker.spark.processing.PySparkProcessor(
        base_job_name="spark-preprocessor",
        framework_version="2.4",
        role=role_arn,
        instance_count=2,
        instance_type="ml.m5.xlarge",
        max_runtime_in_seconds=1800,
    )

    spark_processor.run(
        submit_app="processing.py",
        arguments=['s3_input_bucket', bucket_name,
                   's3_input_file_path', file_path
                   ]
    )

现在我必须使用 Step Functions 自动化工作流程.为此,我编写了一个 lambda 函数来执行此操作,但收到以下错误:

Now I have to automate the workflow by using Step Functions. For this purpose, I've written a lambda function to do that but I receive the following error:

{
  "errorMessage": "Unable to import module 'lambda_function': No module named 'sagemaker'",
  "errorType": "Runtime.ImportModuleError"
}

这是我的 lambda 函数:

This is my lambda function:

import sagemaker

def lambda_handler(event, context):
    spark_processor = sagemaker.spark.processing.PySparkProcessor(
        base_job_name="spark-preprocessor",
        framework_version="2.4",
        role=role_arn,
        instance_count=2,
        instance_type="ml.m5.xlarge",
        max_runtime_in_seconds=1800,
    )

    spark_processor.run(
        submit_app="processing.py",
        arguments=['s3_input_bucket', event["bucket_name"],
                   's3_input_file_path', event["file_path"]
                   ]
    )

我的问题是:如何在我的状态机中创建一个步骤以使用 Sagemaker 处理运行 PySpark 代码?

My question is: How can I create a step in my state machine for running a PySpark code using Sagemaker processing?

谢谢

推荐答案

默认情况下,sagemaker sdk 未安装在 lambda 容器环境中:您应该将其包含在您上传到 s3 的 lambda zip 中.

sagemaker sdk is not installed by default in the lambda container environment: you should include it in the lambda zip that you upload to s3.

有多种方法可以做到这一点,最简单的方法之一是使用 无服务器应用程序模型 (SAM) cli.在这种情况下,将 sagemaker 放在包含 lambda 代码的文件夹中的 requirements.txt 中可能就足够了,SAM 将确保依赖项包含在压缩包.

There are various ways to do this, one of the easiest is to deploy your lambda with Serverless Application Model (SAM) cli. In this case it might be enough to place sagemaker in the requirements.txt placed in the folder that contains your lambda code, and SAM will ensure that the dependency is included in the zip.

或者,您可以使用 pip install sagemaker -t lambda_folder 手动创建 zip,但您应该在 Amazon Linux 操作系统中执行此命令,例如使用具有适当映像的 EC2 或在 Docker 容器中.搜索aws lambda 中的python 依赖项";了解更多信息.

Alternatively you can create the zip manually with pip install sagemaker -t lambda_folder but you should execute this command in an Amazon Linux OS, for example with an EC2 with the appropriate image or in a Docker container. Search for "python dependencies in aws lambda" for more info.

这篇关于Sagemaker 使用 PySpark 和 Step Functions 处理作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆