如何在Virtualenv中为pyspark运行spark-submit? [英] How to run spark-submit in virtualenv for pyspark?

查看：135 发布时间：2021/4/8 20:16:53 apache-spark pyspark spark-submit

本文介绍了如何在Virtualenv中为pyspark运行spark-submit?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在virtualenv中，是否可以运行 spark-submit (来自HDP 3.1.0的spark v2.3.2)?出现以下情况:在virtualenv中具有使用python3(和某些特定的lib)的python文件(以将lib版本与系统其余部分隔离).我想使用/bin/spark-submit 运行此文件，但尝试这样做我得到...

Is there a way to run spark-submit (spark v2.3.2 from HDP 3.1.0) while in a virtualenv? Have situation where have python file that uses python3 (and some specific libs) in a virtualenv (to isolate lib versions from rest of system). I would like to run this file with /bin/spark-submit, but attempting to do so I get...

[me@airflowetl tests]$ source ../venv/bin/activate; /bin/spark-submit sparksubmit.test.py 
  File "/bin/hdp-select", line 255
    print "ERROR: Invalid package - " + name
                                    ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: Invalid package - " + name)?
ls: cannot access /usr/hdp//hadoop/lib: No such file or directory
Exception in thread "main" java.lang.IllegalStateException: hdp.version is not set while running Spark under HDP, please set through HDP_VERSION in spark-env.sh or add a java-opts file in conf with -Dhdp.version=xxx
    at org.apache.spark.launcher.Main.main(Main.java:118)

也尝试过...

(venv) [me@airflowetl tests]$ export HADOOP_CONF_DIR=/etc/hadoop/conf; spark-submit --master yarn --deploy-mode cluster sparksubmit.test.py 
19/12/12 13:50:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/12/12 13:50:20 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
    at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
    ....
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig

...或(从此处 https://www.hackingnote.com/en/spark/trouble-shooting/NoClassDefFoundError-ClientConfig )...

(venv) [airflow@airflowetl tests]$ spark-submit --master yarn --deploy-mode client --conf spark.hadoop.yarn.timeline-service.enabled=false sparksubmit.test.py 
19/12/12 15:22:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/12/12 15:22:49 INFO spark.SparkContext: Running Spark version 2.4.4
19/12/12 15:22:49 INFO spark.SparkContext: Submitted application: hph_etl_TEST
19/12/12 15:22:49 INFO spark.SecurityManager: Changing view acls to: airflow
19/12/12 15:22:49 INFO spark.SecurityManager: Changing modify acls to: airflow
19/12/12 15:22:49 INFO spark.SecurityManager: Changing view acls groups to: 
19/12/12 15:22:49 INFO spark.SecurityManager: Changing modify acls groups to: 
19/12/12 15:22:49 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(airflow); groups with view permissions: Set(); users  with modify permissions: Set(airflow); groups with modify permissions: Set()
19/12/12 15:22:49 INFO util.Utils: Successfully started service 'sparkDriver' on port 45232.
19/12/12 15:22:50 INFO spark.SparkEnv: Registering MapOutputTracker
19/12/12 15:22:50 INFO spark.SparkEnv: Registering BlockManagerMaster
19/12/12 15:22:50 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/12/12 15:22:50 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/12/12 15:22:50 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-320366b6-609a-497b-ac40-119d11682044
19/12/12 15:22:50 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
19/12/12 15:22:50 INFO spark.SparkEnv: Registering OutputCommitCoordinator
19/12/12 15:22:50 INFO util.log: Logging initialized @2663ms
19/12/12 15:22:50 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
19/12/12 15:22:50 INFO server.Server: Started @2763ms
19/12/12 15:22:50 INFO server.AbstractConnector: Started ServerConnector@50a3c656{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
19/12/12 15:22:50 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@306c15f1{/jobs,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2b566f8d{/jobs/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1b5ef515{/jobs/job,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@59f7a5e2{/jobs/job/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@41c58356{/stages,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2d5f2026{/stages/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@324ca89a{/stages/stage,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6f487c61{/stages/stage/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3897116a{/stages/pool,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@68ab090f{/stages/pool/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@42ea3278{/storage,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6eedf530{/storage/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6e71a5c6{/storage/rdd,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5e222a76{/storage/rdd/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4dc8aa38{/environment,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4c8d82c4{/environment/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2fb15106{/executors,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@608faf1c{/executors/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@689e405f{/executors/threadDump,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@48a5742a{/executors/threadDump/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6db93559{/static,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4d7ed508{/,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5510f12d{/api,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6d87de7{/jobs/job/kill,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@62595660{/stages/stage/kill,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://airflowetl.local:4040
19/12/12 15:22:51 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
19/12/12 15:22:51 INFO client.RMProxy: Connecting to ResourceManager at hw001.local/172.18.4.46:8050
19/12/12 15:22:51 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers
19/12/12 15:22:51 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (15360 MB per container)
19/12/12 15:22:51 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
19/12/12 15:22:51 INFO yarn.Client: Setting up container launch context for our AM
19/12/12 15:22:51 INFO yarn.Client: Setting up the launch environment for our AM container
19/12/12 15:22:51 INFO yarn.Client: Preparing resources for our AM container
19/12/12 15:22:51 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/12/12 15:22:53 INFO yarn.Client: Uploading resource file:/tmp/spark-4e600acd-2d34-4271-b01c-25f312906f93/__spark_libs__8368679994314392346.zip -> hdfs://hw001.local:8020/user/airflow/.sparkStaging/application_1572898343646_0029/__spark_libs__8368679994314392346.zip
19/12/12 15:22:54 INFO yarn.Client: Uploading resource file:/home/airflow/projects/hph_etl_airflow/venv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip -> hdfs://hw001.local:8020/user/airflow/.sparkStaging/application_1572898343646_0029/pyspark.zip
19/12/12 15:22:55 INFO yarn.Client: Uploading resource file:/home/airflow/projects/hph_etl_airflow/venv/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip -> hdfs://hw001.local:8020/user/airflow/.sparkStaging/application_1572898343646_0029/py4j-0.10.7-src.zip
19/12/12 15:22:55 INFO yarn.Client: Uploading resource file:/tmp/spark-4e600acd-2d34-4271-b01c-25f312906f93/__spark_conf__5403285055443058510.zip -> hdfs://hw001.local:8020/user/airflow/.sparkStaging/application_1572898343646_0029/__spark_conf__.zip
19/12/12 15:22:55 INFO spark.SecurityManager: Changing view acls to: airflow
19/12/12 15:22:55 INFO spark.SecurityManager: Changing modify acls to: airflow
19/12/12 15:22:55 INFO spark.SecurityManager: Changing view acls groups to: 
19/12/12 15:22:55 INFO spark.SecurityManager: Changing modify acls groups to: 
19/12/12 15:22:55 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(airflow); groups with view permissions: Set(); users  with modify permissions: Set(airflow); groups with modify permissions: Set()
19/12/12 15:22:56 INFO yarn.Client: Submitting application application_1572898343646_0029 to ResourceManager
19/12/12 15:22:56 INFO impl.YarnClientImpl: Submitted application application_1572898343646_0029
19/12/12 15:22:56 INFO cluster.SchedulerExtensionServices: Starting Yarn extension services with app application_1572898343646_0029 and attemptId None
19/12/12 15:22:57 INFO yarn.Client: Application report for application_1572898343646_0029 (state: ACCEPTED)
19/12/12 15:22:57 INFO yarn.Client: 
     client token: N/A
     diagnostics: AM container is launched, waiting for AM container to Register with RM
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1576200176385
     final status: UNDEFINED
     tracking URL: http://hw001.local:8088/proxy/application_1572898343646_0029/
     user: airflow
19/12/12 15:22:58 INFO yarn.Client: Application report for application_1572898343646_0029 (state: FAILED)
19/12/12 15:22:58 INFO yarn.Client: 
     client token: N/A
     diagnostics: Application application_1572898343646_0029 failed 2 times due to AM Container for appattempt_1572898343646_0029_000002 exited with  exitCode: 1
Failing this attempt.Diagnostics: [2019-12-12 15:22:58.214]Exception from container-launch.
Container id: container_e02_1572898343646_0029_02_000001
Exit code: 1

[2019-12-12 15:22:58.215]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/hadoop/yarn/local/usercache/airflow/appcache/application_1572898343646_0029/container_e02_1572898343646_0029_02_000001/launch_container.sh: line 38: $PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:$HADOOP_CONF_DIR:/usr/hdp/3.1.0.0-78/hadoop/*:/usr/hdp/3.1.0.0-78/hadoop/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:$PWD/__spark_conf__/__hadoop_conf__: bad substitution
....

不确定该怎么做或如何进一步进行操作，并且在对其进行谷歌搜索之后还不能完全理解错误消息.

Not sure what to make of this or how to proceed further and did not totally understand the error message after googling it.

有更多经验的人对此有进一步的调试提示或修补程序吗?

Anyone with more experience have any further debugging tips for this or fixes?

如何在Virtualenv中为pyspark运行spark-submit? [英] How to run spark-submit in virtualenv for pyspark?

问题描述

...或(从此处 https://www.hackingnote.com/en/spark/trouble-shooting/NoClassDefFoundError-ClientConfig )...

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Virtualenv中为pyspark运行spark-submit? [英] How to run spark-submit in virtualenv for pyspark?

问题描述

...或(从此处 https://www.hackingnote.com/en/spark/trouble-shooting/NoClassDefFoundError-ClientConfig )...

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭