Spark在本地运行但在YARN中运行时找不到文件 [英] Spark runs in local but can't find file when running in YARN

查看:137
本文介绍了Spark在本地运行但在YARN中运行时找不到文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试提交一个简单的 python 脚本,以便在带有 YARN 的集群中运行它.当我在本地执行作业时,没有问题,一切正常,但是当我在集群中运行它时却失败了.

我使用以下命令执行提交:

spark-submit --master yarn --deploy-mode cluster test.py

我收到的日志错误如下:

17/11/07 13:02:48 INFO yarn.Client:application_1510046813642_0010 的申请报告(状态:接受)17/11/07 13:02:49 INFO yarn.Client:application_1510046813642_0010 的申请报告(状态:已接受)17/11/07 13:02:50 INFO yarn.Client:application_1510046813642_0010 的申请报告(状态:失败)17/11/07 13:02:50 信息纱线.客户端:客户端令牌:不适用诊断:应用程序 application_1510046813642_0010 失败 2 次,原因是 appattempt_1510046813642_0010_000002 的 AM 容器退出,exitCode:-1000有关更详细的输出,请查看应用程序跟踪页面:http://myserver:8088/proxy/application_1510046813642_0010/然后,单击指向每次尝试日志的链接.**诊断:文件不存在:hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py**java.io.FileNotFoundException:文件不存在:hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py在 org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)在 org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)在 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)在 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)在 org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)在 org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)在 org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)在 org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)在 java.security.AccessController.doPrivileged(Native Method)在 javax.security.auth.Subject.doAs(Subject.java:422)在 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)在 org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)在 org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)在 java.util.concurrent.FutureTask.run(FutureTask.java:266)在 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)在 java.util.concurrent.FutureTask.run(FutureTask.java:266)在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)在 java.lang.Thread.run(Thread.java:748)这次尝试失败.申请失败.ApplicationMaster 主机:不适用ApplicationMaster RPC 端口:-1队列:root.users.josholsan开始时间:1510056155796最终状态:失败跟踪网址:http://myserver:8088/cluster/app/application_1510046813642_0010用户:josholsan线程main" org.apache.spark.SparkException 中的异常:应用程序 application_1510046813642_0010 以失败状态完成在 org.apache.spark.deploy.yarn.Client.run(Client.scala:1025)在 org.apache.spark.deploy.yarn.Client$.main(Client.scala:1072)在 org.apache.spark.deploy.yarn.Client.main(Client.scala)在 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)在 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)在 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)在 java.lang.reflect.Method.invoke(Method.java:498)在 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)在 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)在 org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)在 org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)在 org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)17/11/07 13:02:50 INFO util.ShutdownHookManager:调用关闭钩子17/11/07 13:02:50 INFO util.ShutdownHookManager:删除目录/tmp/spark-5cc8bf5e-216b-4d9e-b66d-9dc01a94e851

我特别注意这一行

诊断:文件不存在:hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py

不知道为什么找不到test.py,我也试过放到HDFS下执行作业的用户目录下:/user/josholsan/

为了完成我的帖子,我还想分享我的 test.py 脚本:

from pyspark import SparkContext文件=/用户/josholsan/concepts_copy.csv"sc = SparkContext("local","测试应用程序")textFile = sc.textFile(file).cache()linesWithOMOP=textFile.filter(lambda line: "OMOP" in line).count()linesWithICD=textFile.filter(lambda line: "ICD" in line).count()打印(带有 OMOP 的行:%i,带有 ICD9 的行:%i" % (linesWithOMOP,linesWithICD))

错误是否也在这里?:

sc = SparkContext("local","Test app")

非常感谢您提前提供帮助.

解决方案

转自评论区:

  • sc = SparkContext("local","Test app"):在此处设置 "local" 将覆盖任何命令行设置;来自 文档:<块引用>

    任何指定为标志或属性文件中的值都将传递给应用程序并与通过 SparkConf 指定的值合并.直接在 SparkConf 上设置的属性具有最高优先级,然后标志传递给 spark-submit 或 spark-shell,然后是 spark-defaults.conf 文件中的选项.

  • test.py 文件必须放在它在整个集群中可见的地方.例如.spark-submit --master yarn --deploy-mode 集群 http://somewhere/accessible/to/master/and/workers/test.py
  • 可以使用 --py-files 参数指定任何其他文件和资源(在 mesos 中测试,而不是在 yarn> 不幸的是),例如--py-files http://somewhere/accessible/to/all/extra_python_code_my_code_uses.zip

    编辑:正如@desertnaut 所评论的,这个参数应该在要执行的脚本之前使用.

  • yarn logs -applicationId 将为您提供提交作业的输出.更多这里这里

希望能帮到你,祝你好运!

I've been trying to submit a simple python script to run it in a cluster with YARN. When I execute the job in local, there's no problem, everything works fine but when I run it in the cluster it fails.

I executed the submit with the following command:

spark-submit --master yarn --deploy-mode cluster test.py

The log error I'm receiving is the following one:

17/11/07 13:02:48 INFO yarn.Client: Application report for application_1510046813642_0010 (state: ACCEPTED)
17/11/07 13:02:49 INFO yarn.Client: Application report for application_1510046813642_0010 (state: ACCEPTED)
17/11/07 13:02:50 INFO yarn.Client: Application report for application_1510046813642_0010 (state: FAILED)
17/11/07 13:02:50 INFO yarn.Client: 
     client token: N/A
     diagnostics: Application application_1510046813642_0010 failed 2 times due to AM Container for appattempt_1510046813642_0010_000002 exited with  exitCode: -1000
For more detailed output, check application tracking page:http://myserver:8088/proxy/application_1510046813642_0010/Then, click on links to logs of each attempt.
**Diagnostics: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py**
java.io.FileNotFoundException: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py
    at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
    at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
    at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Failing this attempt. Failing the application.
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: root.users.josholsan
     start time: 1510056155796
     final status: FAILED
     tracking URL: http://myserver:8088/cluster/app/application_1510046813642_0010
     user: josholsan
Exception in thread "main" org.apache.spark.SparkException: Application application_1510046813642_0010 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1025)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1072)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/11/07 13:02:50 INFO util.ShutdownHookManager: Shutdown hook called
17/11/07 13:02:50 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-5cc8bf5e-216b-4d9e-b66d-9dc01a94e851

I put special attention to this line

Diagnostics: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py

I don't know why it can't finde the test.py, I also tried to put it in HDFS under the directory of the user executing the job: /user/josholsan/

To finish my post I would like to share also my test.py script:

from pyspark import SparkContext

file="/user/josholsan/concepts_copy.csv"
sc = SparkContext("local","Test app")
textFile = sc.textFile(file).cache()

linesWithOMOP=textFile.filter(lambda line: "OMOP" in line).count()
linesWithICD=textFile.filter(lambda line: "ICD" in line).count()

print("Lines with OMOP: %i, lines with ICD9: %i" % (linesWithOMOP,linesWithICD))

Could the error also be in here?:

sc = SparkContext("local","Test app")

Thanks you so much for your help in advance.

解决方案

Transferred from the comments section:

  • sc = SparkContext("local","Test app"): having "local" here will override any command line settings; from the docs:

    Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.

  • The test.py file must be placed somewhere where it is visible throughout the whole cluster. E.g. spark-submit --master yarn --deploy-mode cluster http://somewhere/accessible/to/master/and/workers/test.py
  • Any additional files and resources can be specified using the --py-files argument (tested in mesos, not in yarn unfortunately), e.g. --py-files http://somewhere/accessible/to/all/extra_python_code_my_code_uses.zip

    Edit: as @desertnaut commented, this argument should be used before the script to be executed.

  • yarn logs -applicationId <app ID> will give you the output of your submitted job. More here and here

Hope this helps, good luck!

这篇关于Spark在本地运行但在YARN中运行时找不到文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆