Azure ML Studio ML管道-例外:找不到临时文件 [英] Azure ML Studio ML Pipeline - Exception: No temp file found

查看:78
本文介绍了Azure ML Studio ML管道-例外:找不到临时文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经成功运行了ML管道实验,并且发布了没有问题的Azure ML管道.成功运行并发布后立即运行以下命令(即我正在使用Jupyter运行所有单元)时,测试失败!

I've successfully run an ML Pipeline experiment and published the Azure ML Pipeline without issues. When I run the following directly after the successful run and publish (i.e. I'm running all cells using Jupyter), the test fails!

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "***redacted***",
                               "ParameterAssignments": {"process_count_per_node": 6}})
run_id = response.json()["Id"]

这是azureml-logs/70_driver_log.txt中的错误:

Here is the error in azureml-logs/70_driver_log.txt:

[2020-12-10T17:17:50.124303] The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
3 items cleaning up...
Cleanup took 0.20258069038391113 seconds
Traceback (most recent call last):
  File "driver/amlbi_main.py", line 48, in <module>
    main()
  File "driver/amlbi_main.py", line 44, in main
    JobStarter().start_job()
  File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job_starter.py", line 52, in start_job
    job.start()
  File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job.py", line 105, in start
    master.wait()
  File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/master.py", line 301, in wait
    file_helper.start()
  File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 206, in start
    self.analyze_source()
  File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 69, in analyze_source
    raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.

异常:找不到临时文件.作业失败.作业应生成临时文件,否则应在此之前失败.请检查日志以查找原因.

这是logs/sys/warning.txt中的错误:

Here are the errors in logs/sys/warning.txt:

requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://eastus.experiments.azureml.net/execution/v1.0/subscriptions/***redacted***/resourceGroups/***redacted***/providers/Microsoft.MachineLearningServices/workspaces/***redacted***/experiments/***redacted-experiment-name***/runs/***redacted-run-id***/telemetry

[...]

requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url:

具有相同的网址.

下一步...

当我等待几分钟并重新运行以下代码/单元格时.

When I wait a few minutes and rerun the following code/cell.

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "***redacted***",
                               "ParameterAssignments": {"process_count_per_node": 2}})
run_id = response.json()["Id"]

成功完成!?嗯?(我在这里更改了进程计数,但我认为没有什么不同).另外,日志中没有用户错误.

It completes successfully!? Huh? (I changed the process count here, but I don't think that makes a difference). Also, there is no user error here in the logs.

关于这里可能发生什么的任何想法?

Any ideas as to what could be going on here?

在此先感谢您提供任何见解,并祝您编程愉快!:)

Thanks in advance for any insights you might have, and happy coding! :)

==========更新#1:==========

在大约30万行的1个文件上运行.有时这项工作有效,有时却无效.我们尝试了许多具有不同配置设置的版本,这些版本有时会导致失败.将sklearn模型更改为使用n_jobs = 1.我们正在为NLP工作评分文本数据.

Running on 1 file with ~300k rows. Sometimes the job works and sometimes it doesn't. We've tried many versions with different config settings, all result in a failure from time to time. Changed sklearn models to use n_jobs=1. We're scoring text data for NLP work.

default_ds = ws.get_default_datastore()

# output dataset
output_dir = OutputFileDatasetConfig(destination=(def_file_store, 'model/results')).register_on_complete(name='model_inferences')

# location of scoring script
experiment_folder = 'model_pipeline'    

rit = 60*60*24

parallel_run_config = ParallelRunConfig(
    source_directory=experiment_folder,
    entry_script="score.py",
    mini_batch_size="5",
    error_threshold=10,
    output_action="append_row",
    environment=batch_env,
    compute_target=compute_target,
    node_count=5,
    run_invocation_timeout=rit,
    process_count_per_node=1
)

我们的下一个测试将是-将数据的每一行都夹入其自己的文件中.我尝试了30行,即30个文件,每个文件都有1条记录进行评分,但仍然遇到相同的错误.这次我将错误阈值更改为1.

Our next test was going to be - chuck each row of data into its own file. I tried this with just 30 rows i.e. 30 files each with 1 record for scoring, and still getting the same error. This time I changed the error threshold to 1.

2020-12-17 02:26:16,721|ParallelRunStep.ProgressSummary|INFO|112|The ParallelRunStep processed all mini batches. There are 6 mini batches with 30 items. Processed 6 mini batches containing 30 items, 30 succeeded, 0 failed. The error threshold is 1. 
2020-12-17 02:26:16,722|ParallelRunStep.Telemetry|INFO|112|Start concatenating.
2020-12-17 02:26:17,202|ParallelRunStep.FileHelper|ERROR|112|No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
2020-12-17 02:26:17,368|ParallelRunStep.Telemetry|INFO|112|Run status: Running
2020-12-17 02:26:17,495|ParallelRunStep.Telemetry|ERROR|112|Exception occurred executing job: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause..
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/job.py", line 105, in start
    master.wait()
  File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/master.py", line 301, in wait
    file_helper.start()
  File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 206, in start
    self.analyze_source()
  File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 69, in analyze_source
    raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.

在完成的回合中,仅返回一些记录.我认为一次返回的记录数是25或23,而另一次是15.

And on the rounds where it does complete, only some of the records are returned. One time the # of records returned I think was 25 or 23, and another time it was 15.

==========更新#2:2020年12月17日==========

我删除了一个模型(我的模型是15个模型的重量混合).我什至清理了我的文本字段,删除了所有选项卡,换行符和逗号.现在,我为30个文件评分,每个文件都有1条记录,该作业有时会完成,但不会返回30条记录.其他时候,它返回错误,并且仍然得到找不到临时文件".错误.

I removed one of my models (my model is a weight blend of 15 models). I even cleaned up my text fields, removing all tabs, newlines, and commas. Now I'm scoring 30 files, each with 1 record, and the job completes sometimes, but it doesn't return 30 records. Other times it returns an error, and still getting "No temp file found" error.

推荐答案

我想我可能已经回答了我自己的问题.我认为问题出在

I think I might have answered my own question. I think the issue was with

OutputFileDatasetConfig

一旦我重新使用

PipelineData

一切都重新开始了.当他们说OutputFileDatasetsetConfig仍在试验中时,我猜Azure并不是在开玩笑.

Everything started working again. I guess Azure wasn't kidding when they say that OutputFileDatasetConfig is still experimental.

我仍然不明白的是,我们应该如何从没有OutputFileDatasetConfig的数据工厂管道中获取ML Studio管道的结果?PipelineData根据子步骤运行ID在文件夹中输出结果,那么Data Factory应该如何知道从何处获取结果?很想听听任何人可能有的任何反馈.谢谢:)

The thing I still don't understand is how we're supposed to pick up the results of an ML Studio Pipeline from a Data Factory Pipeline without OutputFileDatasetConfig? PipelineData outputs the results in a folder based on the child step run id, so how is Data Factory supposed to know where to get the results? Would love to hear any feedback anyone might have. Thanks :)

==更新==

要从数据工厂管道中提取ML Studio管道的结果,请查看

For picking up results of an ML Studio Pipeline from a Data Factory Pipeline, check out Pick up Results From ML Studio Pipeline in Data Factory Pipeline

==更新#2 ==

https://github.com/Azure/azure-sdk-for-python/issues/16568#issuecomment-781526789

@ yeamusic21,您好,感谢您提供的最新版本的反馈意见,OutputDatasetConfig无法与ParallelRunStep一起使用,我们正在努力修复它.

Hi @yeamusic21 , thank you for your feedback, in current version, OutputDatasetConfig can't work with ParallelRunStep, we are working on fixing it.

这篇关于Azure ML Studio ML管道-例外:找不到临时文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆