将 jars 添加到 Spark 作业 - spark-submit [英] Add jars to a Spark Job - spark-submit

查看:89
本文介绍了将 jars 添加到 Spark 作业 - spark-submit的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是的……已经讨论了很多.

True ... it has been discussed quite a lot.

然而,有很多含糊之处,并且提供了一些答案......包括在 jars/executor/driver 配置或选项中复制 jar 引用.

However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the jars/executor/driver configuration or options.

应为每个选项澄清以下模棱两可、不清楚和/或遗漏的细节:

Following ambiguity, unclear, and/or omitted details should be clarified for each option:

  • ClassPath 受到的影响
    • 司机
    • Executor(用于正在运行的任务)
    • 两者
    • 完全没有
    • 用于任务(对每个执行者)
    • 用于远程驱动程序(如果在集群模式下运行)
    1. --jars
    2. SparkContext.addJar(...) 方法
    3. SparkContext.addFile(...) 方法
    4. --conf spark.driver.extraClassPath=...--driver-class-path ...
    5. --conf spark.driver.extraLibraryPath=...--driver-library-path ...
    6. --conf spark.executor.extraClassPath=...
    7. --conf spark.executor.extraLibraryPath=...
    8. 别忘了,spark-submit 的最后一个参数也是一个 .jar 文件.

    我知道在哪里可以找到主要的火花文档,特别是关于如何提交options 可用,还有 JavaDoc.然而,这给我留下了很多漏洞,尽管它也部分回答了.

    I am aware where I can find the main spark documentation, and specifically about how to submit, the options available, and also the JavaDoc. However that left for me still quite some holes, although it answered partially too.

    希望没有那么复杂,希望有人能给我一个清晰简洁的答案.

    I hope that it is not all that complex, and that someone can give me a clear and concise answer.

    如果我从文档中猜测,似乎 --jarsSparkContext addJaraddFile方法是自动分发文件的方法,而其他选项只是修改ClassPath.

    If I were to guess from documentation, it seems that --jars, and the SparkContext addJar and addFile methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.

    假设为简单起见,我可以同时使用 3 个主要选项添加其他应用程序 jar 文件是否安全:

    Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:

    spark-submit --jar additional1.jar,additional2.jar \
      --driver-library-path additional1.jar:additional2.jar \
      --conf spark.executor.extraLibraryPath=additional1.jar:additional2.jar \
      --class MyClass main-application.jar
    

    另一个帖子的答案上找到了一篇不错的文章.然而没有学到新东西.海报确实很好地说明了本地驱动程序(yarn-client)和远程驱动程序(yarn-cluster)之间的区别.牢记这一点绝对重要.

    Found a nice article on an answer to another posting. However nothing new learned. The poster does make a good remark on the difference between Local driver (yarn-client) and Remote Driver (yarn-cluster). Definitely important to keep in mind.

    推荐答案

    ClassPath:

    ClassPath 会受到影响,具体取决于您提供的内容.有几种方法可以在类路径上设置一些东西:

    ClassPath:

    ClassPath is affected depending on what you provide. There are a couple of ways to set something on the classpath:

    • spark.driver.extraClassPath 或者它的别名 --driver-class-path 在运行驱动程序的节点上设置额外的类路径.
    • spark.executor.extraClassPath 在 Worker 节点上设置额外的类路径.
    • spark.driver.extraClassPath or it's alias --driver-class-path to set extra classpaths on the node running the driver.
    • spark.executor.extraClassPath to set extra class path on the Worker nodes.

    如果您希望某个 JAR 同时作用于 Master 和 Worker,则必须在 BOTH 标志中分别指定它们.

    If you want a certain JAR to be effected on both the Master and the Worker, you have to specify these separately in BOTH flags.

    遵循与 JVM 相同的规则:

    • Linux:冒号 :
      • 例如:--conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar:/opt/prog/aws-java-sdk-1.10.50.jar"
      • 例如:--conf "spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar;/opt/prog/aws-java-sdk-1.10.50.jar"

      这取决于您运行作业的模式:

      This depends on the mode which you're running your job under:

      1. 客户端模式 - Spark 启动 Netty HTTP 服务器,该服务器在启动时为每个工作节点分发文件.您可以在开始 Spark 作业时看到:

      1. Client mode - Spark fires up a Netty HTTP server which distributes the files on start up for each of the worker nodes. You can see that when you start your Spark job:

      16/05/08 17:29:12 INFO HttpFileServer: HTTP File server directory is /tmp/spark-48911afa-db63-4ffc-a298-015e8b96bc55/httpd-84ae312b-5863-4f4c-a1ea-537bfca2bc2b
      16/05/08 17:29:12 INFO HttpServer: Starting HTTP Server
      16/05/08 17:29:12 INFO Utils: Successfully started service 'HTTP file server' on port 58922.
      16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/foo.jar at http://***:58922/jars/com.mycode.jar with timestamp 1462728552732
      16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/aws-java-sdk-1.10.50.jar at http://***:58922/jars/aws-java-sdk-1.10.50.jar with timestamp 1462728552767
      

    • 集群模式 - 在集群模式下,spark 选择了一个领导工作节点来执行驱动程序进程.这意味着作业不是直接从主节点运行的.在这里,Spark 不会设置 HTTP 服务器.您必须通过对所有节点都可用的 HDFS/S3/其他来源,手动将您的 JARS 提供给所有工作节点.

    • Cluster mode - In cluster mode spark selected a leader Worker node to execute the Driver process on. This means the job isn't running directly from the Master node. Here, Spark will not set an HTTP server. You have to manually make your JARS available to all the worker node via HDFS/S3/Other sources which are available to all nodes.

      接受的文件 URI

      "Submitting Applications" 中,Spark 文档做得很好解释可接受的文件前缀的工作:

      Accepted URI's for files

      In "Submitting Applications", the Spark documentation does a good job of explaining the accepted prefixes for files:

      使用 spark-submit 时,应用程序 jar 以及任何 jar包含在 --jars 选项中将自动转移到集群.Spark 使用以下 URL 方案来允许不同的传播罐子的策略:

      When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. Spark uses the following URL scheme to allow different strategies for disseminating jars:

      • file: - 绝对路径和 file:/URI 由驱动程序的 HTTP 提供服务文件服务器,每个执行程序从驱动程序 HTTP 中拉取文件服务器.
      • hdfs:, http:, https:, ftp: - 这些下拉文件和 JAR来自 URI 的预期
      • local: - 以 local:/开头的 URI预期作为每个工作节点上的本地文件存在.这意味着不会产生网络 IO,并且适用于大文件/JAR推送给每个工作人员,或通过 NFS、GlusterFS 等共享.

      请注意,JAR 和文件被复制到每个执行器节点上的 SparkContext.

      Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes.

      如前所述,JAR 被复制到每个 Worker 节点的工作目录.那究竟是哪里?它通常/var/run/spark/work下,你会看到它们是这样的:

      As noted, JARs are copied to the working directory for each Worker node. Where exactly is that? It is usually under /var/run/spark/work, you'll see them like this:

      drwxr-xr-x    3 spark spark   4096 May 15 06:16 app-20160515061614-0027
      drwxr-xr-x    3 spark spark   4096 May 15 07:04 app-20160515070442-0028
      drwxr-xr-x    3 spark spark   4096 May 15 07:18 app-20160515071819-0029
      drwxr-xr-x    3 spark spark   4096 May 15 07:38 app-20160515073852-0030
      drwxr-xr-x    3 spark spark   4096 May 15 08:13 app-20160515081350-0031
      drwxr-xr-x    3 spark spark   4096 May 18 17:20 app-20160518172020-0032
      drwxr-xr-x    3 spark spark   4096 May 18 17:20 app-20160518172045-0033
      

      当您查看内部时,您会看到您部署的所有 JAR:

      And when you look inside, you'll see all the JARs you deployed along:

      [*@*]$ cd /var/run/spark/work/app-20160508173423-0014/1/
      [*@*]$ ll
      total 89988
      -rwxr-xr-x 1 spark spark   801117 May  8 17:34 awscala_2.10-0.5.5.jar
      -rwxr-xr-x 1 spark spark 29558264 May  8 17:34 aws-java-sdk-1.10.50.jar
      -rwxr-xr-x 1 spark spark 59466931 May  8 17:34 com.mycode.code.jar
      -rwxr-xr-x 1 spark spark  2308517 May  8 17:34 guava-19.0.jar
      -rw-r--r-- 1 spark spark      457 May  8 17:34 stderr
      -rw-r--r-- 1 spark spark        0 May  8 17:34 stdout
      

      受影响的选项:

      要了解的最重要的事情是优先级.如果您通过代码传递任何属性,它将优先于您通过 spark-submit 指定的任何选项.Spark 文档中提到了这一点:

      Affected options:

      The most important thing to understand is priority. If you pass any property via code, it will take precedence over any option you specify via spark-submit. This is mentioned in the Spark documentation:

      任何指定为标志或属性文件中的值都将被传递到应用程序并与通过指定的那些合并火花会议.直接在 SparkConf 上设置的属性最高优先级,然后标志传递给 spark-submit 或 spark-shell,然后spark-defaults.conf 文件中的选项

      Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file

      因此请确保将这些值设置在正确的位置,这样当一个值优先于另一个时您就不会感到惊讶.

      So make sure you set those values in the proper places, so you won't be surprised when one takes priority over the other.

      让我们分析每个选项:

      • --jars vs SparkContext.addJar:这些都是一样的,只有一个是通过spark提交设置的,一个是通过代码设置的.选择更适合您的那一款.需要注意的一件重要事情是,使用这些选项中的任何一个都不会将 JAR 添加到您的驱动程序/执行程序类路径,您需要使用 extraClassPath 配置显式添加它们
      • SparkContext.addJarSparkContext.addFile:当您有需要与代码一起使用的依赖项时,请使用前者.当您只想将任意文件传递给工作节点时,请使用后者,这不是代码中的运行时依赖项.
      • --conf spark.driver.extraClassPath=...--driver-class-path:这些是别名,你选择哪个无关紧要
      • --conf spark.driver.extraLibraryPath=...,或 --driver-library-path ... 同上,别名.
      • --conf spark.executor.extraClassPath=...:当您有一个无法包含在 uber JAR 中的依赖项时使用此选项(例如,因为存在编译时冲突)库版本之间)以及您需要在运行时加载的内容.
      • --conf spark.executor.extraLibraryPath=... 这作为 JVM 的 java.library.path 选项传递.当你需要一个对 JVM 可见的库路径时使用这个.
      • --jars vs SparkContext.addJar: These are identical, only one is set through spark submit and one via code. Choose the one which suites you better. One important thing to note is that using either of these options does not add the JAR to your driver/executor classpath, you'll need to explicitly add them using the extraClassPath config on both.
      • SparkContext.addJar vs SparkContext.addFile: Use the former when you have a dependency that needs to be used with your code. Use the latter when you simply want to pass an arbitrary file around to your worker nodes, which isn't a run-time dependency in your code.
      • --conf spark.driver.extraClassPath=... or --driver-class-path: These are aliases, doesn't matter which one you choose
      • --conf spark.driver.extraLibraryPath=..., or --driver-library-path ... Same as above, aliases.
      • --conf spark.executor.extraClassPath=...: Use this when you have a dependency which can't be included in an uber JAR (for example, because there are compile time conflicts between library versions) and which you need to load at runtime.
      • --conf spark.executor.extraLibraryPath=... This is passed as the java.library.path option for the JVM. Use this when you need a library path visible to the JVM.

      假设为简单起见,我可以添加额外的同时使用 3 个主要选项的应用程序 jar 文件:

      Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:

      您可以安全地假设这仅适用于客户端模式,而不是集群模式.正如我之前所说的.此外,你给出的例子有一些多余的论点.例如,将 JAR 传递给 --driver-library-path 是没有用的,如果您希望它们在您的类路径上,您需要将它们传递给 extraClassPath.最终,当您在驱动程序和工作线程上部署外部 JAR 时,您想要做的是:

      You can safely assume this only for Client mode, not Cluster mode. As I've previously said. Also, the example you gave has some redundant arguments. For example, passing JARs to --driver-library-path is useless, you need to pass them to extraClassPath if you want them to be on your classpath. Ultimately, what you want to do when you deploy external JARs on both the driver and the worker is:

      spark-submit --jars additional1.jar,additional2.jar \
        --driver-class-path additional1.jar:additional2.jar \
        --conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
        --class MyClass main-application.jar
      

      这篇关于将 jars 添加到 Spark 作业 - spark-submit的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆