在HDInsight Spark群集中训练后清理资源时出错 [英] Error while cleaning up resources after training in HDInsight Spark cluster

查看:81
本文介绍了在HDInsight Spark群集中训练后清理资源时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在AMLS中进行样本实验。我的目标是使用python3在远程hdi集群中运行它。

I'm training a sample experiment in AMLS. My goal is to run it in a remote hdi cluster with python3.

我手动安装了所有的集群上需要的库(在头节点和工作节点中)。

I manually installed all the libraries that needed on the cluster (both in head nodes and worker nodes).

通过检查记录在每次运行中创建的文件我可以看到它已成功完成,但由于某种原因,运行状态失败。

By checking the logs files created for in each run I can see that it completed succesfully, but for some reason, the run took a failed status.

运行成功后(我看到最后输出的打印消息)我的脚本),群集开始清理所有资源,它似乎寻找"/输出"文件夹到"/ users / sshuser"
路径。这是错误信息

After running succesfuly (I see the print messages that I put at the end of my script), cluster starts cleaning all the resources, and it seems it looks for a "/output" folder into "/users/sshuser" path. This is the error message

-------------- -----------------------

-------------------------------------

实验成功完成。完成运行...

清理所有优秀的跑步操作,等待300.0秒

:34225e;字体大小:14px">
2物品清理......

清理工作0.5020554065704346秒

Traceback(最近一次调用最后一次) ):

文件"context_manager_injector.py" ;,第160行,在 

:font-size:#14p2\">
execute_with_context (cm_objects,options.invocation)

文件" CON text_manager_injector.py",第113行,在execute_with_context

log_finalizing()

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3。 6 / contextlib.py",
第380行, 
退出

raise exc_details [1]

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3。 6 / contextlib.py",
line 365,in 
退出

if cb(* exc_details):

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3。 6 / contextlib.py",
第284行,在_exit_wrapper


返回cm_exit(cm,* exc_details)

文件" context_manager_injector.py",第43行,  退出

self.context_manager。 退出 (* exc_details)

File" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/context_managers.py" ;,第67行,  退出

self.history_context。 退出 (* args)

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3。 6 / site-packages / azureml / _history / utils / context_managers.py",
第54行, 
退出

return self._exit_stack。 退出 (* args)

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3。 6 / contextlib.py",
第380行, 
退出

raise exc_details [1]

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3。 6 / contextlib.py",
line 365,in 
退出

if cb(* exc_details):

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3。 6 / contextlib.py",
第284行,在_exit_wrapper


返回cm_exit(cm,* exc_details)

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3。 6 / site-packages / azureml / _history / utils / context_managers.py",
第140行, 
退出

self.py_wd.track(self.run_tracker,self.trackfolders,self.deny_list)

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3。 6 / site-packages / azureml / _history / utils / context_managers.py",
第93行,在轨道


fs.track(run_tracker,track_folders,blacklist)

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3。 6 / site-packages / azureml / _history / utils / filesystem.py",
第138行,在轨道中


self._upload_hdi_outputs(run_tracker)

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3。 6 / site-packages / azureml / _history / utils / filesystem.py",
第177行,在_upload_hdi_outputs


upload_from_hdfs(run_tracker,track_prefix)

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3。 6 / site-packages / azureml / _history / utils / _hdi_utils.py",
第63行,在upload_from_hdfs


aa = file_system.listFiles(path,True)

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py" ;,第1257行,在  call

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/pyspark.zip/pyspark/sql/utils.py" ;,第63行,装饰

文件" /mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py" ;,第328行,在get_return_value

py4j.protocol.Py4JJavaError:调用o170.listFiles时发生错误。

:java.io.FileNotFoundException:文件/文件夹不存在:/ clusters / hdipi36 / user / sshuser /< experiment_name> 1557330292_20ee084d / outputs [c963e3ed-e386-42dc-b771-61db38b2092d] [2019 -05-08T08:47:03.6900690-07:00]
[ServerRequestId:c963e3ed-e386-42dc-b771-61db38b2092d]


at sun.reflec

The experiment completed successfully. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.5020554065704346 seconds
Traceback (most recent call last):
File "context_manager_injector.py", line 160, in 
execute_with_context(cm_objects, options.invocation)
File "context_manager_injector.py", line 113, in execute_with_context
log_finalizing()
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3.6/contextlib.py", line 380, in exit
raise exc_details[1]
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3.6/contextlib.py", line 365, in exit
if cb(*exc_details):
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3.6/contextlib.py", line 284, in _exit_wrapper
return cm_exit(cm, *exc_details)
File "context_manager_injector.py", line 43, in exit
self.context_manager.exit(*exc_details)
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/context_managers.py", line 67, in exit
self.history_context.exit(*args)
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 54, in exit
return self._exit_stack.exit(*args)
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3.6/contextlib.py", line 380, in exit
raise exc_details[1]
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3.6/contextlib.py", line 365, in exit
if cb(*exc_details):
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3.6/contextlib.py", line 284, in _exit_wrapper
return cm_exit(cm, *exc_details)
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 140, in exit
self.py_wd.track(self.run_tracker, self.trackfolders, self.deny_list)
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 93, in track
fs.track(run_tracker, track_folders, blacklist)
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3.6/site-packages/azureml/_history/utils/filesystem.py", line 138, in track
self._upload_hdi_outputs(run_tracker)
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3.6/site-packages/azureml/_history/utils/filesystem.py", line 177, in _upload_hdi_outputs
upload_from_hdfs(run_tracker, track_prefix)
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/azureml_c876362074e4ef12eac17653e81a4b83.zip/home/sshuser/.azureml/envs/azureml_c876362074e4ef12eac17653e81a4b83/lib/python3.6/site-packages/azureml/_history/utils/_hdi_utils.py", line 63, in upload_from_hdfs
aa = file_system.listFiles(path, True)
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1557328483739_0005/container_1557328483739_0005_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o170.listFiles.
: java.io.FileNotFoundException: File/Folder does not exist: /clusters/hdipi36/user/sshuser/<experiment_name>1557330292_20ee084d/outputs [c963e3ed-e386-42dc-b771-61db38b2092d][2019-05-08T08:47:03.6900690-07:00] [ServerRequestId:c963e3ed-e386-42dc-b771-61db38b2092d]
at sun.reflec

推荐答案

嗨罗德里戈,

您是否可以尝试从控制台删除群集并尝试重新运行实验以检查后续运行是否可以正确删除新群集?

Could you please try to delete the cluster from the console and try to re-run the experiment to check if subsequent runs can delete correctly the new cluster correctly?

-Rohit


这篇关于在HDInsight Spark群集中训练后清理资源时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆