添加罐子到火花的工作 - 火花提交 [英] Add jars to a Spark Job - spark-submit

查看:154
本文介绍了添加罐子到火花的工作 - 火花提交的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

真...它已经讨论了不少。

True ... it has been discussed quite a lot.

但是有很多歧义和一些问题的答案提供......包括在坛子里/执行器/驱动器配置或选项复制JAR引用。

However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the jars/executor/driver configuration or options.

继模糊,不清晰,和/或省略细节应明确每个选项:

Following ambiguity, unclear, and/or omitted details should be clarified for each option:


  • 类路径是如何受到影响

    • 驱动程序

    • 执行器(用于任务运行)

    • 双方

    • 并不


    • 的任务(每个执行人)

    • 远程驱动器(如果在集群模式下运行)


    1. - 罐子

    2. SparkContext.addJar(...) 方法

    3. SparkContext.addFile(...) 方法

    4. - 配置spark.driver.extraClassPath = ... - 驱动程序类路径...

    5. - 配置spark.driver.extraLibraryPath = ... - 驱动程序库路径...

    6. - 配置spark.executor.extraClassPath = ...

    7. - 配置spark.executor.extraLibraryPath = ...

    8. 不要忘记,火花提交也是一个.jar文件的最后一个参数。

    1. --jars
    2. SparkContext.addJar(...) method
    3. SparkContext.addFile(...) method
    4. --conf spark.driver.extraClassPath=... or --driver-class-path ...
    5. --conf spark.driver.extraLibraryPath=..., or --driver-library-path ...
    6. --conf spark.executor.extraClassPath=...
    7. --conf spark.executor.extraLibraryPath=...
    8. not to forget, the last parameter of the spark-submit is also a .jar file.

    我知道我在哪里可以找到主火花文档,并具体了解如何提交的的选项使用,也是的的JavaDoc 。但是,这种留给我还是相当一些漏洞,尽管它回答部分了。

    I am aware where I can find the main spark documentation, and specifically about how to submit, the options available, and also the JavaDoc. However that left for me still quite some holes, although it answered partially too.

    我希望它是不是所有的复杂,虽然有人可以给我一个明确和简洁的答案。

    I hope that it is not all that complex, and that someone can give me a clear and concise answer.

    如果我是从文档猜测,似乎 - 罐子 SparkContext addJar addFile 方法是,将自动分发文件的人,而其他选项仅仅是修改后的classpath。

    If I were to guess from documentation, it seems that --jars, and the SparkContext addJar and addFile methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.

    难道是安全的假设,为了简单起见,我可以在同一时间使用添加3个主要选项的其他应用程序的jar文件:

    Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:

    spark-submit --jar additional1.jar,additional2.jar \
      --driver-library-path additional1.jar:additional2.jar \
      --conf spark.executor.extraLibraryPath=additional1.jar:additional2.jar \
      --class MyClass main-application.jar
    

    上找到回答另一个发布。但是没有什么新的教训。海报确实使对本地驱动器(纱的客户端)和远程驱动器(丝簇)之间的区别了良好的话。绝对重要的是要记住。

    Found a nice article on an answer to another posting. However nothing new learned. The poster does make a good remark on the difference between Local driver (yarn-client) and Remote Driver (yarn-cluster). Definitely important to keep in mind.

    推荐答案

    类路径是取决于你提供什么样的影响。有一对夫妇的方式来设置classpath上的内容:

    ClassPath:

    ClassPath is affected depending on what you provide. There are a couple of ways to set something on the classpath:


    • spark.driver.extraClassPath 或它的别名 - 驱动程序类路径来设置额外的类路径主节点。

    • spark.executor.extraClassPath 来设置额外的类路径中的工作节点上。

    • spark.driver.extraClassPath or it's alias --driver-class-path to set extra classpaths on the Master node.
    • spark.executor.extraClassPath to set extra class path on the Worker nodes.

    如果你想成为的法师和工人都实现了一定的JAR,你必须在这两个标志分别指定这些。

    If you want a certain JAR to be effected on both the Master and the Worker, you have to specify these separately in BOTH flags.

    继相同的规则JVM


    • 的Linux:一个collon

      • 例如: - 配置\"spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar:/opt/prog/aws-java-sdk-1.10.50.jar\"


      • 例如: - 配置\"spark.driver.extraClassPath=/opt/prog/hadoop-aws-2.7.1.jar;/opt/prog/aws-java-sdk-1.10.50.jar\"

      这要看你是下运行作业的方式:

      This depends on the mode which you're running your job under:


      1. 客户端模式 - 第一星火起来负责分发文件的Netty HTTP服务器上启动了每个工人节点。你可以看到,当您启动星火工作:

      1. Client mode - Spark first up a Netty HTTP server which distributes the files on start up for each of the worker nodes. You can see that when you start your Spark job:

      16/05/08 17:29:12 INFO HttpFileServer: HTTP File server directory is /tmp/spark-48911afa-db63-4ffc-a298-015e8b96bc55/httpd-84ae312b-5863-4f4c-a1ea-537bfca2bc2b
      16/05/08 17:29:12 INFO HttpServer: Starting HTTP Server
      16/05/08 17:29:12 INFO Utils: Successfully started service 'HTTP file server' on port 58922.
      16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/foo.jar at http://***:58922/jars/com.clicktale.ai.pageview-creator_0.0.3.0.jar with timestamp 1462728552732
      16/05/08 17:29:12 INFO SparkContext: Added JAR /opt/aws-java-sdk-1.10.50.jar at http://***:58922/jars/aws-java-sdk-1.10.50.jar with timestamp 1462728552767
      


    • 集群模式 - 在集群模式下选择火花的领导工作节点上执行的驱动程序。这意味着工作不直接从主节点运行。在这里,星火的不会设置一个HTTP服务器。您必须手动使您JARS可通过HDFS / S3 /这是提供给所有节点的其他来源的所有工作节点。

    • Cluster mode - In cluster mode spark selected a leader Worker node to execute the Driver process on. This means the job isn't running directly from the Master node. Here, Spark will not set an HTTP server. You have to manually make your JARS available to all the worker node via HDFS/S3/Other sources which are available to all nodes.

      提交申请,星火文档做得很好解释文件接受prefixes的工作:

      Accepted URI's for files

      In "Submitting Applications", the Spark documentation does a good job of explaining the accepted prefixes for files:

      在使用火花提交申请罐子任何jar一起
        附带--jars选项将被自动转移到
        群集。星火使用以下URL方案,让不同的
        传播罐子策略:

      When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. Spark uses the following URL scheme to allow different strategies for disseminating jars:


          
      • 文件: - 绝对路径和文件:/ URI是由驱动程序的HTTP服务
          文件服务器和遗嘱执行人提取从司机HTTP文件
          服务器。

      •   
      • HDFS:,HTTP:,HTTPS:,FTP: - 这些下拉文件和JAR
          从URI预期

      •   
      • 地方 - 一个URI开始与地方:/是
          预期存在,作为每个工作节点上的本地文件。这意味着
          没有网络IO会发生,而且可以很好地用于大文件/ JAR文件
          这都推到每一个工人,或通过NFS,GlusterFS等共享。

      •   

      注意,JAR和文件复制到工作目录为每
        SparkContext执行人节点上。

      Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes.

      如前所述,JAR文件复制到的的工作目录的每个工作节点。确切位置在哪里是什么?它的通常在 / var / run中/火花/工作,你会看到他们是这样的:

      As noted, JARs are copied to the working directory for each Worker node. Where exactly is that? It is usually under /var/run/spark/work, you'll see them like this:

      drwxr-xr-x    3 spark spark   4096 May 15 06:16 app-20160515061614-0027
      drwxr-xr-x    3 spark spark   4096 May 15 07:04 app-20160515070442-0028
      drwxr-xr-x    3 spark spark   4096 May 15 07:18 app-20160515071819-0029
      drwxr-xr-x    3 spark spark   4096 May 15 07:38 app-20160515073852-0030
      drwxr-xr-x    3 spark spark   4096 May 15 08:13 app-20160515081350-0031
      drwxr-xr-x    3 spark spark   4096 May 18 17:20 app-20160518172020-0032
      drwxr-xr-x    3 spark spark   4096 May 18 17:20 app-20160518172045-0033
      

      而当你看里面,你会看到你一起部署的所有JAR文件:

      And when you look inside, you'll see all the JARs you deployed along:

      [*@*]$ cd /var/run/spark/work/app-20160508173423-0014/1/
      [*@*]$ ll
      total 89988
      -rwxr-xr-x 1 spark spark   801117 May  8 17:34 awscala_2.10-0.5.5.jar
      -rwxr-xr-x 1 spark spark 29558264 May  8 17:34 aws-java-sdk-1.10.50.jar
      -rwxr-xr-x 1 spark spark 59466931 May  8 17:34 com.mycode.code.jar
      -rwxr-xr-x 1 spark spark  2308517 May  8 17:34 guava-19.0.jar
      -rw-r--r-- 1 spark spark      457 May  8 17:34 stderr
      -rw-r--r-- 1 spark spark        0 May  8 17:34 stdout
      

      受影响的选项:

      要理解的最重要的事情是的优先级。如果你通过通过code任何财产,这将需要precedence超过您通过指定任何选项火花提交。这是星火文档中提到的:

      Affected options:

      The most important thing to understand is priority. If you pass any property via code, it will take precedence over any option you specify via spark-submit. This is mentioned in the Spark documentation:

      指定为标志或属性的任何值文件将传递
        到应用程序,并合并与那些通过指定
        SparkConf。 属性上直接设置SparkConf采取最高
        precedence
      ,然后旗帜传递到火花提交或火花壳,然后
        在火花defaults.conf文件选项

      Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file

      因此​​,请确保您在适当的地方设置这些值,所以当一个比其他优先级,你会不会感到惊讶。

      So make sure you set those values in the proper places, so you won't be surprised when one takes priority over the other.

      让我们分析问题的每个选项:

      Lets analyze each option in question:


      • - 罐子 VS SparkContext.addJar :这是相同的,只有一个是通过设置火花提交一通过code。选择哪套房,为您更好地之一

      • SparkContext.addJar VS SparkContext.addFile :当你使用前者的依赖需要与你的code使用。使用后者时,你只是想通过周围任意文件到您的工作节点,这是不是在你的code运行时依赖关系。

      • - 配置spark.driver.extraClassPath = ... - 驱动程序类路径:这些别名,不要紧,你选择哪一个

      • - CONF spark.driver.extraLibraryPath = ...,或--driver库路径... :同上,别名

      • - 配置spark.executor.extraClassPath = ... :当你有不能被包括在尤伯杯JAR的依赖(例如利用这一点,因为有库版本之间的编译时间冲突),哪些是你需要在运行时加载。

      • - 配置spark.executor.extraLibraryPath = ... 这是因为通过的java.library.path JVM的选项。当你需要一个库路径JVM可见使用此功能。

      • --jars vs SparkContext.addJar: These are identical, only one is set through spark submit and one via code. Choose the one which suites you better
      • SparkContext.addJar vs SparkContext.addFile: Use the former when you have a dependency that needs to be used with your code. Use the latter when you simply want to pass an arbitrary file around to your worker nodes, which isn't a run-time dependency in your code.
      • --conf spark.driver.extraClassPath=... or --driver-class-path: These are aliases, doesn't matter which one you choose
      • --conf spark.driver.extraLibraryPath=..., or --driver-library-path ... Same as above, aliases.
      • --conf spark.executor.extraClassPath=...: Use this when you have a dependency which can't be included in an uber JAR (for example, because there are compile time conflicts between library versions) and which you need to load at runtime.
      • --conf spark.executor.extraLibraryPath=... This is passed as the java.library.path option for the JVM. Use this when you need a library path visible to the JVM.

      难道是安全的假设,为了简单起见,我可以添加额外的
        使用3个主要的选项,同时应用程序jar文件:

      Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:

      您可以安全地只为客户端模式,而不是群集模式假设这一点。正如我previously说。此外,您给的例子有一些多余的参数。例如,传递JAR来 - 驱动程序库路径是没用的,你需要将它们传递给 extraClassPath 如果你想他们是在你的类路径中。最后,你要当你部署对驾驶员和工人既有外部的JAR做的是:

      You can safely assume this only for Client mode, not Cluster mode. As I've previously said. Also, the example you gave has some redundant arguments. For example, passing JARs to --driver-library-path is useless, you need to pass them to extraClassPath if you want them to be on your classpath. Ultimately, what you want to do when you deploy external JARs on both the driver and the worker is:

      spark-submit --jar additional1.jar,additional2.jar \
        --driver-extra-class-path additional1.jar:additional2.jar \
        --conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
        --class MyClass main-application.jar
      

      这篇关于添加罐子到火花的工作 - 火花提交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆