无法让pyspark作业在hadoop群集的所有节点上运行 [英] Can't get pyspark job to run on all nodes of hadoop cluster

查看：542 发布时间：2018/5/31 18:58:36 hadoop apache-spark pyspark

本文介绍了无法让pyspark作业在hadoop群集的所有节点上运行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

摘要：我无法让我的python-spark作业在我的hadoop群集的所有节点上运行。
我已经为hadoop'spark-1.5.2-bin-hadoop2.6'安装了spark。当启动一个java spark工作时，负载在所有节点上分配
，当启动一个python spark工作时，只有一个节点负载该工作。

安装：

hdfs和纱线配置为4个节点：nk01（namenode），nk02，nk03，nk04，在xen虚拟服务器上

版本：jdk1.8.0_66，hadoop-2.7.1，spark-1.5.2-bin-hadoop2.6

hadoop安装了所有4个节点

spark仅安装在nk01上

我复制了一堆Gutenberg文件（谢谢你，Johannes！）到hdfs上，并尝试在文件的子集（以'e'开头的文件）上使用java和python做一个wordcount：

使用自制的Python脚本来做wordcount：

/ opt / spark / bin / spark-submit wordcount.py - 主要纱线集群\ --num -executeors 4 --executor-cores 1
ython代码分配4个部分：

tt = sc.textFile（'/ user / me / gutenberg / text / e * .txt '，4）
在60秒内加载4个节点：

结论：Java版本在集群中分布负载，python版本只运行在1个节点上。
：我如何获得python版本还可以在所有节点上分配负载？
Spark-submit
./ bin / spark-submit \ --class< main-class> --master< master-url> \ --deploy-mode< deploy-mode> \ --conf< key> =< value> \ ...＃其他选项 < application-jar> \ [application-arguments]
这里有一些与scala / java submit in不同的地方参数位置。
$ b

对于Python应用程序，只需传递一个.py文件来代替
application-jar 而不是JAR，并使用--py-files将Python .zip，.egg或.py文件添加到搜索路径中。

你应该使用下面的命令来代替：

/ opt / spark / bin / spark-submit --master yarn-cluster wordcount.py
--num-executors 4 --executor-cores 1

Summary: I can't get my python-spark job to run on all nodes of my hadoop cluster. I've installed the spark for hadoop 'spark-1.5.2-bin-hadoop2.6'. When launching a java spark job, the load gets distributed over all nodes, when launching a python spark job, only the one node takes the load.

Setup:

hdfs and yarn configured for 4 nodes: nk01 (namenode), nk02, nk03, nk04, running on xen virtual servers

versions: jdk1.8.0_66, hadoop-2.7.1, spark-1.5.2-bin-hadoop2.6

hadoop installed all 4 nodes

spark only installed on nk01

I copied a bunch of Gutenberg files (thank you, Johannes!) onto hdfs, and try doing a wordcount using java and python on a subset of the files (the files that start with an 'e') :

Python:

Using a homebrew python script for doing wordcount:
/opt/spark/bin/spark-submit wordcount.py --master yarn-cluster \ --num-executors 4 --executor-cores 1
The Python code assigns 4 partions:
tt=sc.textFile('/user/me/gutenberg/text/e*.txt',4)
Load on the 4 nodes during 60 seconds:

Java:

Using the JavaWordCount found in the spark distribution:
/opt/spark/bin/spark-submit --class JavaWordCount --master yarn-cluster \ --num-executors 4 jwc.jar '/user/me/gutenberg/text/e*.txt'

Conclusion: the java version distributes its load across the cluster, the python version just runs on 1 node.

Question: how do I get the python version also to distribute the load across all nodes?
解决方案
Spark-submit
./bin/spark-submit \ --class <main-class> --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]
Here are some different with scala/java submit in parameter position.

For Python applications, simply pass a .py file in the place of application-jar instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.

You should use below command instead:
/opt/spark/bin/spark-submit --master yarn-cluster wordcount.py --num-executors 4 --executor-cores 1

这篇关于无法让pyspark作业在hadoop群集的所有节点上运行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

无法让pyspark作业在hadoop群集的所有节点上运行 [英] Can't get pyspark job to run on all nodes of hadoop cluster

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

无法让pyspark作业在hadoop群集的所有节点上运行 [英] Can&#39;t get pyspark job to run on all nodes of hadoop cluster

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

无法让pyspark作业在hadoop群集的所有节点上运行 [英] Can't get pyspark job to run on all nodes of hadoop cluster

登录关闭