Azure ML Studio ML管道-例外:找不到临时文件 [英] Azure ML Studio ML Pipeline - Exception: No temp file found

查看：78 发布时间：2021/4/13 19:37:03 azure machine-learning nlp etl ml-studio

本文介绍了Azure ML Studio ML管道-例外:找不到临时文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经成功运行了ML管道实验，并且发布了没有问题的Azure ML管道.成功运行并发布后立即运行以下命令(即我正在使用Jupyter运行所有单元)时，测试失败！

I've successfully run an ML Pipeline experiment and published the Azure ML Pipeline without issues. When I run the following directly after the successful run and publish (i.e. I'm running all cells using Jupyter), the test fails!

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "***redacted***",
                               "ParameterAssignments": {"process_count_per_node": 6}})
run_id = response.json()["Id"]

这是azureml-logs/70_driver_log.txt中的错误:

Here is the error in azureml-logs/70_driver_log.txt:

[2020-12-10T17:17:50.124303] The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
3 items cleaning up...
Cleanup took 0.20258069038391113 seconds
Traceback (most recent call last):
  File "driver/amlbi_main.py", line 48, in <module>
    main()
  File "driver/amlbi_main.py", line 44, in main
    JobStarter().start_job()
  File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job_starter.py", line 52, in start_job
    job.start()
  File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/job.py", line 105, in start
    master.wait()
  File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/master.py", line 301, in wait
    file_helper.start()
  File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 206, in start
    self.analyze_source()
  File "/mnt/batch/tasks/shared/LS_root/jobs/***redacted***/azureml/***redacted***/mounts/workspaceblobstore/azureml/***redacted***/driver/file_helper.py", line 69, in analyze_source
    raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.

异常:找不到临时文件.作业失败.作业应生成临时文件，否则应在此之前失败.请检查日志以查找原因.

这是logs/sys/warning.txt中的错误:

Here are the errors in logs/sys/warning.txt:

requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://eastus.experiments.azureml.net/execution/v1.0/subscriptions/***redacted***/resourceGroups/***redacted***/providers/Microsoft.MachineLearningServices/workspaces/***redacted***/experiments/***redacted-experiment-name***/runs/***redacted-run-id***/telemetry

[...]

requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url:

具有相同的网址.

下一步...

当我等待几分钟并重新运行以下代码/单元格时.

When I wait a few minutes and rerun the following code/cell.

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "***redacted***",
                               "ParameterAssignments": {"process_count_per_node": 2}})
run_id = response.json()["Id"]

成功完成！?嗯?(我在这里更改了进程计数，但我认为没有什么不同).另外，日志中没有用户错误.

It completes successfully!? Huh? (I changed the process count here, but I don't think that makes a difference). Also, there is no user error here in the logs.

关于这里可能发生什么的任何想法?

Any ideas as to what could be going on here?

在此先感谢您提供任何见解，并祝您编程愉快！:)

Thanks in advance for any insights you might have, and happy coding! :)

==========更新#1:==========

在大约30万行的1个文件上运行.有时这项工作有效，有时却无效.我们尝试了许多具有不同配置设置的版本，这些版本有时会导致失败.将sklearn模型更改为使用n_jobs = 1.我们正在为NLP工作评分文本数据.

Running on 1 file with ~300k rows. Sometimes the job works and sometimes it doesn't. We've tried many versions with different config settings, all result in a failure from time to time. Changed sklearn models to use n_jobs=1. We're scoring text data for NLP work.

default_ds = ws.get_default_datastore()

# output dataset
output_dir = OutputFileDatasetConfig(destination=(def_file_store, 'model/results')).register_on_complete(name='model_inferences')

# location of scoring script
experiment_folder = 'model_pipeline'    

rit = 60*60*24

parallel_run_config = ParallelRunConfig(
    source_directory=experiment_folder,
    entry_script="score.py",
    mini_batch_size="5",
    error_threshold=10,
    output_action="append_row",
    environment=batch_env,
    compute_target=compute_target,
    node_count=5,
    run_invocation_timeout=rit,
    process_count_per_node=1
)

我们的下一个测试将是-将数据的每一行都夹入其自己的文件中.我尝试了30行，即30个文件，每个文件都有1条记录进行评分，但仍然遇到相同的错误.这次我将错误阈值更改为1.

Our next test was going to be - chuck each row of data into its own file. I tried this with just 30 rows i.e. 30 files each with 1 record for scoring, and still getting the same error. This time I changed the error threshold to 1.

2020-12-17 02:26:16,721|ParallelRunStep.ProgressSummary|INFO|112|The ParallelRunStep processed all mini batches. There are 6 mini batches with 30 items. Processed 6 mini batches containing 30 items, 30 succeeded, 0 failed. The error threshold is 1. 
2020-12-17 02:26:16,722|ParallelRunStep.Telemetry|INFO|112|Start concatenating.
2020-12-17 02:26:17,202|ParallelRunStep.FileHelper|ERROR|112|No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.
2020-12-17 02:26:17,368|ParallelRunStep.Telemetry|INFO|112|Run status: Running
2020-12-17 02:26:17,495|ParallelRunStep.Telemetry|ERROR|112|Exception occurred executing job: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause..
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/job.py", line 105, in start
    master.wait()
  File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/master.py", line 301, in wait
    file_helper.start()
  File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 206, in start
    self.analyze_source()
  File "/mnt/batch/tasks/shared/LS_root/jobs/**redacted**/mounts/workspaceblobstore/azureml/**redacted**/driver/file_helper.py", line 69, in analyze_source
    raise Exception(message)
Exception: No temp file found. The job failed. A job should generate temp files or should fail before this. Please check logs for the cause.

在完成的回合中，仅返回一些记录.我认为一次返回的记录数是25或23，而另一次是15.

And on the rounds where it does complete, only some of the records are returned. One time the # of records returned I think was 25 or 23, and another time it was 15.

==========更新#2:2020年12月17日==========

我删除了一个模型(我的模型是15个模型的重量混合).我什至清理了我的文本字段，删除了所有选项卡，换行符和逗号.现在，我为30个文件评分，每个文件都有1条记录，该作业有时会完成，但不会返回30条记录.其他时候，它返回错误，并且仍然得到找不到临时文件".错误.

I removed one of my models (my model is a weight blend of 15 models). I even cleaned up my text fields, removing all tabs, newlines, and commas. Now I'm scoring 30 files, each with 1 record, and the job completes sometimes, but it doesn't return 30 records. Other times it returns an error, and still getting "No temp file found" error.

Azure ML Studio ML管道-例外:找不到临时文件 [英] Azure ML Studio ML Pipeline - Exception: No temp file found

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

Azure ML Studio ML管道-例外:找不到临时文件 [英] Azure ML Studio ML Pipeline - Exception: No temp file found

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭