读取驱动程序通过火花提交发送的文件 [英] Read files sent with spark-submit by the driver

查看:89
本文介绍了读取驱动程序通过火花提交发送的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在发送Spark作业,以通过运行在远程集群上运行

I am sending a Spark job to run on a remote cluster by running

spark-submit ... --deploy-mode cluster --files some.properties ...

我想通过驱动程序代码读取some.properties文件的内容,即在创建Spark上下文并启动RDD任务之前.该文件将复制到远程驱动程序,但不会复制到驱动程序的工作目录.

I want to read the content of the some.properties file by the driver code, i.e. before creating the Spark context and launching RDD tasks. The file is copied to the remote driver, but not to the driver's working directory.

我知道解决此问题的方法是:

The ways around this problem that I know of are:

  1. 将文件上传到HDFS
  2. 将文件存储在应用jar中

两者都不方便,因为此文件经常在提交的开发计算机上更改.

Both are inconvenient since this file is frequently changed on the submitting dev machine.

在驱动程序代码main方法期间,是否可以读取使用--files标志上传的文件?

Is there a way to read the file that was uploaded using the --files flag during the driver code main method?

推荐答案

是的,您可以访问通过--files参数上传的文件.

Yes, you can access files uploaded via the --files argument.

这就是我访问通过--files传递的文件的方式:

This is how I'm able to access files passed in via --files:

./bin/spark-submit \
--class com.MyClass \
--master yarn-cluster \
--files /path/to/some/file.ext \
--jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar \
/path/to/app.jar file.ext

并在我的Spark代码中:

and in my Spark code:

val filename = args(0)
val linecount = Source.fromFile(filename).getLines.size

我确实相信这些文件会被下载到与jar放在同一目录中的worker上,这就是为什么简单地传递文件名而不是传递Source.fromFile的绝对路径的原因.

I do believe these files are downloaded onto the workers in the same directory as the jar is placed, which is why simply passing the filename and not the absolute path to Source.fromFile works.

这篇关于读取驱动程序通过火花提交发送的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆