如何在Spark提交作业中访问外部属性文件? [英] How to access external property file in spark-submit job?

查看:356
本文介绍了如何在Spark提交作业中访问外部属性文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用spark 2.4.1版本和java8. 我正在尝试使用spark-submit提交我的spark作业时加载外部属性文件.

I am using spark 2.4.1 version and java8. I am trying to load external property file while submitting my spark job using spark-submit.

我在TypeSafe下方使用它来加载我的属性文件.

As I am using below TypeSafe to load my property file.

 <groupId>com.typesafe</groupId>
    <artifactId>config</artifactId>
    <version>1.3.1</version>

在我的代码中,我正在使用

In my code I am using

public static Config loadEnvProperties(String environment) {
      Config appConf = ConfigFactory.load();  // loads my "resouces" folder "application.properties" file
      return  appConf.getConfig(environment);
  }

要外部化此"application.properties"文件,我按照专家的建议尝试了此操作,同时进行了如下的火花提交

To externalize this "application.properties" file I tried this as suggested by an expert while spark-submit as below

spark-submit \
--master yarn \
--deploy-mode cluster \
--name Extractor  \
--jars "/local/apps/jars/*.jar" \
--files /local/apps/log4j.properties \
--files /local/apps/applicationNew.properties \
--class Driver \
--conf spark.driver.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.debug \
--conf spark.driver.extraClassPath=. \
  migration-0.0.1.jar sit 

我放置了"log4j.properties"& "applicationNew.properties"在我运行spark-submit的文件所在的文件夹中保存.

I placed "log4j.properties" & "applicationNew.properties" files same folder where I am running my spark-submit.

1)在上面的shell脚本中,如果我保留

1) In the above shell script if I keep

--files /local/apps/log4j.properties,  /local/apps/applicationNew.properties \

错误:

Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/local/apps//applicationNew.properties
        at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)

那这里怎么了?

2)然后,我更改了上面显示的脚本,如

2) Then i changed above script like shown i.e.

  --files /local/apps/log4j.properties \
    --files /local/apps/applicationNew.properties \

当我运行spark作业时,会出现以下错误.

when I run spark job then I will get following error.

19/08/02 14:19:09 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit'
        at com.typesafe.config.impl.SimpleConfig.findKeyOrNull(SimpleConfig.java:152)

那么这里出什么问题了?为什么不加载applicationNew.properties文件?

So what is wrong here ? why not loading the applicationNew.properties file ?

3)当我如下调试时 即打印"config.file"

3) When I debugged it as below i.e. printed "config.file"

String ss = System.getProperty("config.file");
logger.error ("config.file : {}" , ss); 

错误:

19/08/02 14:19:09 ERROR Driver: config.file : null
19/08/02 14:19:09 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit'

那么如何从spark-submit设置"config.file"选项呢?

So how to set "config.file" option from spark-submit ?

如何解决上述错误并从外部applicationNew.properties文件加载属性?

How to fix above errors and load properties from external applicationNew.properties file ?

推荐答案

-文件和SparkFiles.get

使用--files,您应该使用SparkFiles.get来访问资源,如下所示:

--files and SparkFiles.get

With --files you should access the resource using SparkFiles.get as follows:

$ ./bin/spark-shell --files README.md

scala> import org.apache.spark._
import org.apache.spark._

scala> SparkFiles.get("README.md")
res0: String = /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8w0000gn/T/spark-f0b16df1-fba6-4462-b956-fc14ee6c675a/userFiles-eef6d900-cd79-4364-a4a2-dd177b4841d2/README.md

换句话说,Spark将--files分发给执行者,但是知道文件路径的唯一方法是使用SparkFiles实用程序.

In other words, Spark will distribute the --files to executors, but the only way to know the path of the files is to use SparkFiles utility.

另一种选择是将所有资源文件打包到一个jar文件中,并将其与其他jar文件捆绑在一起(作为单个uber-jar或仅作为Spark应用程序的CLASSPATH的一部分)并使用以下技巧:

The other option would be to package all resource files into a jar file and bundle it together with the other jar files (either as a single uber-jar or simply as part of CLASSPATH of the Spark app) and use the following trick:

this.getClass.getClassLoader.getResourceAsStream(resourceFile)

因此,无论resourceFile所在的jar文件是什么,只要它在CLASSPATH上,它就可供应用程序使用.

With that, regardless of the jar file the resourceFile is in, as long as it's on the CLASSPATH, it should be available to the application.

我很确定使用资源文件进行配置的任何体面的框架或库,例如Typesafe Config,接受InputStream作为读取资源文件的方式.

I'm pretty sure any decent framework or library that uses resource files for configuration, e.g. Typesafe Config, accepts InputStream as the way to read resource files.

您还可以将--files包含在jar文件中,该文件是执行程序的CLASSPATH的一部分,但这显然不那么灵活(因为每次您要使用不同的文件,则必须重新创建jar).

You could also include the --files as part of a jar file that is part of the CLASSPATH of the executors, but that'd be obviously less flexible (as every time you'd like to submit your Spark app with a different file, you'd have to recreate the jar).

这篇关于如何在Spark提交作业中访问外部属性文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆