pyspark:用船JAR依赖火花提交 [英] pyspark: ship jar dependency with spark-submit

查看:1335
本文介绍了pyspark:用船JAR依赖火花提交的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个pyspark脚本读取两个JSON文件,协同组他们并将结果发送到elasticsearch集群;一切正常(大部分)如预期,当我在本地运行它,我下载了 elasticsearch-的Hadoop为 org.elasticsearch.hadoop.mr jar文件。 EsOutputFormat org.elasticsearch.hadoop.mr.LinkedMapWritable 类,然后运行我的工作用pyspark的 - 罐子的说法,我可以看到的文件出现在我的elasticsearch集群。

I wrote a pyspark script that reads two json files, coGroup them and sends the result to an elasticsearch cluster; everything works (mostly) as expected when I run it locally, I downloaded the elasticsearch-hadoop jar file for the org.elasticsearch.hadoop.mr.EsOutputFormat and org.elasticsearch.hadoop.mr.LinkedMapWritable classes, and then run my job with pyspark using the --jars argument, and I can see documents appearing in my elasticsearch cluster.

当我尝试火花集群上运行它,但是,我得到这个错误:

When I try to run it on a spark cluster, however, I'm getting this error:

Traceback (most recent call last):
  File "/root/spark/spark_test.py", line 141, in <module>
    conf=es_write_conf
  File "/root/spark/python/pyspark/rdd.py", line 1302, in saveAsNewAPIHadoopFile
    keyConverter, valueConverter, jconf)
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: java.lang.ClassNotFoundException: org.elasticsearch.hadoop.mr.LinkedMapWritable
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:274)
    at org.apache.spark.util.Utils$.classForName(Utils.scala:157)
    at org.apache.spark.api.python.PythonRDD$$anonfun$getKeyValueTypes$1$$anonfun$apply$9.apply(PythonRDD.scala:611)
    at org.apache.spark.api.python.PythonRDD$$anonfun$getKeyValueTypes$1$$anonfun$apply$9.apply(PythonRDD.scala:610)
    at scala.Option.map(Option.scala:145)
    at org.apache.spark.api.python.PythonRDD$$anonfun$getKeyValueTypes$1.apply(PythonRDD.scala:610)
    at org.apache.spark.api.python.PythonRDD$$anonfun$getKeyValueTypes$1.apply(PythonRDD.scala:609)
    at scala.Option.flatMap(Option.scala:170)
    at org.apache.spark.api.python.PythonRDD$.getKeyValueTypes(PythonRDD.scala:609)
    at org.apache.spark.api.python.PythonRDD$.saveAsNewAPIHadoopFile(PythonRDD.scala:701)
    at org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

这似乎是pretty明确对我说:在 elasticsearch-的hadoop 罐子上没有工人;这样的问题:我怎么把它与我的应用程序一起?我可以使用 sc.addPyFile 的蟒蛇依赖,但它不会与罐子工作,并使用 - 罐子参数火花提交于事无补。

which seems pretty clear to me: the elasticsearch-hadoop jar is not available on the workers; so the question: how do I send it along with my app? I could use sc.addPyFile for a python dependency, but it won't work with jars, and using the --jars parameters of spark-submit doesn't help.

推荐答案

- 罐子只是工作;问题是我怎么跑在首位的火花提交工作;执行正确的方法是:

The --jars just works; the problem is how I run the spark-submit job in the first place; the correct way to execute is:

./bin/spark-submit <options> scriptname

因此​​, - 罐子选项必须放在脚本之前:

Therefore the --jars option must be placed before the script:

./bin/spark-submit --jars /path/to/my.jar myscript.py

这如有明显的,如果你认为这是参数传递给脚本本身,因为一切的唯一方法后,脚本名称将被用作脚本的输入参数:

This if obvious if you think that this is the only way to pass arguments to the script itself, as everything after the script name will be used as input arguments for the script:

./bin/spark-submit --jars /path/to/my.jar myscript.py --do-magic=true

这篇关于pyspark:用船JAR依赖火花提交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆