在Azure HDIinsight群集中指定--files时,Spark提交在纱线群集模式下失败 [英] Spark submit failing in yarn cluster mode when specifying --files in an Azure HDIinsight cluster
问题描述
在纱线簇模式下火花提交失败,但在客户端模式下成功提交
Spark submit in yarn cluster mode failing but its successful in client mode
火花提交:
spark-submit
--master yarn --deploy-mode cluster \
--py-files packages.zip,deps2.zip \
--files /home/sshsanjeev/git/pyspark-example-demo/configs/etl_config.json \
jobs/etl_job.py
Error stack:
Traceback (most recent call last):
File "etl_job.py", line 51, in <module>
main()
File "etl_job.py", line 11, in main
app_name='my_etl_job',spark_config={'spark.sql.shuffle.partitions':2})
File "/mnt/resource/hadoop/yarn/local/usercache/sshsanjeev/appcache/application_1555349704365_0218/container_1555349704365_0218_01_000001/packages.zip/dependencies/spark_conn.py", line 20, in start_spark
File "/usr/hdp/current/spark2-client/python/pyspark/context.py", line 891, in addFile
self._jsc.sc().addFile(path, recursive)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o204.addFile.
: java.io.FileNotFoundException: File file:/mnt/resource/hadoop/yarn/local/usercache/sshsanjeev/appcache/application_1555349704365_0218/container_1555349704365_0218_01_000001/configs/etl_config.json does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:624)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:850)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:614)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:422)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1529)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
进行了几次在线搜索.关注本文 https://community.cloudera.com/t5/support-questions/spark-job-fails-in-cluster-mode/td-p/58772 ,但问题仍未解决.
Did several online search. Followed this article https://community.cloudera.com/t5/Support-Questions/Spark-job-fails-in-cluster-mode/td-p/58772 but still the issue is not resolved.
请注意,我已经尝试了两种方法,方法是将配置文件放置在Namenode的本地路径以及HDFS目录中,但是仍然遇到相同的错误.同样在客户端模式下,它可以成功运行.需要指导
Please note that I have tried 2 approaches by placing in the config file in the local path of Namenode as well as in the HDFS directory but still getting the same error. Also in client mode this runs successfully. Need guidance
这是我的HDP群集的堆栈版本
Here is Stack version of my HDP cluster
HDP-2.6.5.3008纱2.7.3Spark2 2.3.2
HDP-2.6.5.3008 YARN 2.7.3 Spark2 2.3.2
让我知道是否需要进一步的信息.任何建议将不胜感激.
Let me know if further info is required. Any suggestions will be highly appreciated.
推荐答案
它可能与无法创建目录的权限问题有关.如果未创建目录,则它将没有占位符来放置中间结果.因此,它失败了.引用/mnt/resource/hadoop/yarn/local/usercache/< username>/appcache/< applicationID>
的目录用于存储中间结果,然后根据需要转到HDFS/内存是否将其写入路径还是分别存储在临时表中.用户可能没有权限.作业完成后,它就会被冲洗掉.在特定工作节点中的路径/mnt/resource/hadoop/yarn/local/user
缓存中向用户提供正确的权限,即可解决此问题.
It could be related to the permission issue which is not able to create the directory. If the directory is not getting created then it will not have a place holder to place the intermediate results. Hence it fails. The directory referred /mnt/resource/hadoop/yarn/local/usercache/<username>/appcache/<applicationID>
is used to store the intermediate results and then it goes to HDFS/memory depending on whether it is written to a path or stored in temp tables respectively. The user might not have permission. Once the job finishes it gets flushed out. Providing correct permissions to the user in the path /mnt/resource/hadoop/yarn/local/user
cache in the specific worker node should resolve the issue.
这篇关于在Azure HDIinsight群集中指定--files时,Spark提交在纱线群集模式下失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!