带有DataTransferStep的Azure ML PipelineData导致0字节文件 [英] Azure ML PipelineData with DataTransferStep results in 0 bytes file

查看:73
本文介绍了带有DataTransferStep的Azure ML PipelineData导致0字节文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用azureml Python SDK构建Azure ML管道.管道调用PythonScriptStep,该数据将数据存储在AML工作区的工作区blobstore中.

我想扩展管道以将管道数据导出到Azure Data Lake(第1代).据我了解,Azure ML不支持将PythonScriptStep的输出直接连接到Azure Data Lake(第1代).因此,我在管道中添加了一个额外的DataTransferStep,它将PythonScriptStep的输出直接输入到DataTransferStep中.根据Microsoft文档,这应该是可能的.

到目前为止,我已经构建了此解决方案,仅在Gen 1数据湖上产生了0字节的文件.我认为output_export_blob PipelineData没有正确引用test.csv,因此DataTransferStep找不到输入.如何将DataTransferStep与PythonScriptStep的PipelineData输出正确连接?

我遵循的示例:

  • test_upload_stackoverflow.py 中,您在调用 .to_csv()时将 PipelineData 视为目录,而不是将文件您只需调用 df_data_all.to_csv(args.output_extract,index = False).也许尝试使用 is_directory = True 定义 PipelineData .不确定是否需要这样做.
  • I am building an Azure ML pipeline with the azureml Python SDK. The pipeline calls a PythonScriptStep which stores data on the workspaceblobstore of the AML workspace.

    I would like to extend the pipeline to export the pipeline data to an Azure Data Lake (Gen 1). Connecting the output of the PythonScriptStep directly to Azure Data Lake (Gen 1) is not supported by Azure ML as far as I understand. Therefore, I added an extra DataTransferStep to the pipeline, which takes the output from the PythonScriptStep as input directly into the DataTransferStep. According to the Microsoft documentation this should be possible.

    So far I have built this solution, only this results in a file of 0 bytes on the Gen 1 Data Lake. I think the output_export_blob PipelineData does not correctly references the test.csv, and therefore the DataTransferStep cannot find the input. How can I connect the DataTransferStep correctly with the PipelineData output from the PythonScriptStep?

    Example I followed: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-data-dependency-steps.ipynb

    pipeline.py

    input_dataset = delimited_dataset(
        datastore=prdadls_datastore,
        folderpath=FOLDER_PATH_INPUT,
        filepath=INPUT_PATH
    )
    
    output_export_blob = PipelineData(
        'export_blob',
        datastore=workspaceblobstore_datastore,
    )
    
    test_step = PythonScriptStep(
        script_name="test_upload_stackoverflow.py",
        arguments=[
            "--output_extract", output_export_blob,
        ],
        inputs=[
            input_dataset.as_named_input('input'),
        ],
        outputs=[output_export_blob],
        compute_target=aml_compute,
        source_directory="."
    )
    
    output_export_adls = DataReference(
        datastore=prdadls_datastore, 
        path_on_datastore=os.path.join(FOLDER_PATH_OUTPUT, 'test.csv'),
        data_reference_name='export_adls'        
    )
    
    export_to_adls = DataTransferStep(
        name='export_output_to_adls',
        source_data_reference=output_export_blob,
        source_reference_type='file',
        destination_data_reference=output_export_adls,
        compute_target=adf_compute
    )
    
    pipeline = Pipeline(
        workspace=aml_workspace, 
        steps=[
            test_step, 
            export_to_adls
        ]
    )
    

    test_upload_stackoverflow.py

    import os
    import pathlib
    from azureml.core import Datastore, Run
    
    parser = argparse.ArgumentParser("train")
    parser.add_argument("--output_extract", type=str)
    args = parser.parse_args() 
    
    run = Run.get_context()
    df_data_all = (
        run
        .input_datasets["input"]
        .to_pandas_dataframe()
    )
    
    os.makedirs(args.output_extract, exist_ok=True)
    df_data_all.to_csv(
        os.path.join(args.output_extract, "test.csv"), 
        index=False
    )
    

    解决方案

    The code example is immensely helpful. Thanks for that. You're right that it can be confusing to get PythonScriptStep -> PipelineData. Working initially even without the DataTransferStep.

    I don't know 100% what's going on, but I thought I'd spitball some ideas:

    1. Does your PipelineData, export_blob, contain the "test.csv" file? I would verify that before troubleshooting the DataTransferStep. You can verify this using the SDK, or more easily with the UI.

      1. Go to the PipelineRun page, click on the PythonScriptStep in question.
      2. On "Outputs + Logs" page, there's a "Data Outputs" Section (that is slow to load initially)
      3. Open it and you'll see the output PipelineDatas then click on "View Output"
      4. Navigate to given path either in the Azure Portal or Azure Storage Explorer.

    2. In test_upload_stackoverflow.py you are treating the PipelineData as a directory when call .to_csv() as opposed to a file which would be you just calling df_data_all.to_csv(args.output_extract, index=False). Perhaps try defining the PipelineData with is_directory=True. Not sure if this is required though.

    这篇关于带有DataTransferStep的Azure ML PipelineData导致0字节文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆