Python 工作者无法重新连接 [英] Python worker failed to connect back
问题描述
我是 Spark 的新手,正在尝试完成 Spark 教程:教程链接
在本地机器(Win10 64、Python 3、Spark 2.4.0)上安装它并设置所有环境变量(HADOOP_HOME、SPARK_HOME 等)后,我试图通过 WordCount.py 文件运行一个简单的 Spark 作业:>
from pyspark import SparkContext, SparkConf如果 __name__ == "__main__":conf = SparkConf().setAppName("word count").setMaster("local[2]")sc = SparkContext(conf = conf)lines = sc.textFile("C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/in/word_count.text")words = lines.flatMap(lambda line: line.split(" "))wordCounts = words.countByValue()对于单词,在 wordCounts.items() 中计数:打印({}:{}".格式(字,计数))
从终端运行后:
spark-submit WordCount.py
我收到以下错误.我检查(通过逐行注释)它在
崩溃wordCounts = words.countByValue()
知道我应该检查什么才能使它工作吗?
回溯(最近一次调用最后一次):文件C:\Users\mjdbr\Anaconda3\lib\runpy.py",第 193 行,在 _run_module_as_main"__main__", mod_spec)_run_code 中的文件C:\Users\mjdbr\Anaconda3\lib\runpy.py",第 85 行exec(代码,run_globals)文件C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",第 25 行,在 <module> 中.ModuleNotFoundError: 没有名为资源"的模块18/11/10 23:16:58 错误执行程序:阶段 0.0 中任务 0.0 中的异常(TID 0)org.apache.spark.SparkException:Python 工作线程无法重新连接.在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)在 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)在 org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)在 org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)在 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)在 org.apache.spark.scheduler.Task.run(Task.scala:121)在 org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)在 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)在 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)在 java.util.concurrent.ThreadPoolExecutor.runWorker(来源不明)在 java.util.concurrent.ThreadPoolExecutor$Worker.run(未知来源)在 java.lang.Thread.run(未知来源)引起:java.net.SocketTimeoutException:接受超时在 java.net.DualStackPlainSocketImpl.waitForNewConnection(本机方法)在 java.net.DualStackPlainSocketImpl.socketAccept(来源不明)在 java.net.AbstractPlainSocketImpl.accept(未知来源)在 java.net.PlainSocketImpl.accept(未知来源)在 java.net.ServerSocket.implAccept(Unknown Source)在 java.net.ServerSocket.accept(Unknown Source)在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)……还有 14 个18/11/10 23:16:58 错误 TaskSetManager:阶段 0.0 中的任务 0 失败了 1 次;中止工作回溯(最近一次调用最后一次):文件C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/rdd/WordCount.py",第 19 行,在 <module> 中wordCounts = words.countByValue()文件C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py",第 1261 行,在 countByValue 中文件C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py",第844行,reduce文件C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py",第816行,收集文件C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py",第 1257 行,在 __call__文件C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py",第 328 行,在 get_return_value 中py4j.protocol.Py4JJavaError:调用 z:org.apache.spark.api.python.PythonRDD.collectAndServe 时出错.:org.apache.spark.SparkException:由于阶段失败而中止作业:阶段 0.0 中的任务 0 失败 1 次,最近一次失败:在阶段 0.0 中丢失任务 0.0(TID 0、本地主机、执行程序驱动程序):org.apache.spark.SparkException:Python 工作线程无法连接回.在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)在 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)在 org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)在 org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)在 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)在 org.apache.spark.scheduler.Task.run(Task.scala:121)在 org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)在 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)在 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)在 java.util.concurrent.ThreadPoolExecutor.runWorker(来源不明)在 java.util.concurrent.ThreadPoolExecutor$Worker.run(未知来源)在 java.lang.Thread.run(未知来源)引起:java.net.SocketTimeoutException:接受超时在 java.net.DualStackPlainSocketImpl.waitForNewConnection(本机方法)在 java.net.DualStackPlainSocketImpl.socketAccept(来源不明)在 java.net.AbstractPlainSocketImpl.accept(未知来源)在 java.net.PlainSocketImpl.accept(未知来源)在 java.net.ServerSocket.implAccept(Unknown Source)在 java.net.ServerSocket.accept(Unknown Source)在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)……还有 14 个驱动程序堆栈跟踪:在 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)在 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)在 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)在 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)在 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)在 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)在 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)在 scala.Option.foreach(Option.scala:257)在 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)在 org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)在 org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)在 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)在 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)在 org.apache.spark.rdd.RDD.withScope(RDD.scala:363)在 org.apache.spark.rdd.RDD.collect(RDD.scala:944)在 org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)在 org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)在 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)在 sun.reflect.NativeMethodAccessorImpl.invoke(来源不明)在 sun.reflect.DelegatingMethodAccessorImpl.invoke(未知来源)在 java.lang.reflect.Method.invoke(Unknown Source)在 py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在 py4j.Gateway.invoke(Gateway.java:282)在 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)在 py4j.commands.CallCommand.execute(CallCommand.java:79)在 py4j.GatewayConnection.run(GatewayConnection.java:238)在 java.lang.Thread.run(未知来源)引起:org.apache.spark.SparkException:Python worker 无法连接回.在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)在 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)在 org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)在 org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)在 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)在 org.apache.spark.scheduler.Task.run(Task.scala:121)在 org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)在 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)在 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)在 java.util.concurrent.ThreadPoolExecutor.runWorker(来源不明)在 java.util.concurrent.ThreadPoolExecutor$Worker.run(未知来源)... 1 更多引起:java.net.SocketTimeoutException:接受超时在 java.net.DualStackPlainSocketImpl.waitForNewConnection(本机方法)在 java.net.DualStackPlainSocketImpl.socketAccept(来源不明)在 java.net.AbstractPlainSocketImpl.accept(未知来源)在 java.net.PlainSocketImpl.accept(未知来源)在 java.net.ServerSocket.implAccept(Unknown Source)在 java.net.ServerSocket.accept(Unknown Source)在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)……还有 14 个
正如鸭嘴兽所建议的 - 检查资源"模块是否可以直接从终端导入 - 显然不是:
<预><代码>>>>导入资源回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中ModuleNotFoundError: 没有名为资源"的模块在安装资源方面 - 我遵循了的说明本教程:
- 从 Apache Spark 网站下载了 spark-2.4.0-bin-hadoop2.7.tgz
- 将其解压缩到我的 C 驱动器
- 已经安装了 Python_3(Anaconda 发行版)和 Java
- 创建本地C:\hadoop\bin"文件夹来存储 winutils.exe
- 创建C:\tmp\hive"文件夹并授予 Spark 访问权限
- 添加环境变量(SPARK_HOME、HADOOP_HOME 等)
我应该安装任何额外的资源吗?
我遇到了同样的错误.我通过安装以前版本的 Spark(2.3 而不是 2.4)解决了这个问题.现在完美了,可能是pyspark最新版本的问题.
I'm a newby with Spark and trying to complete a Spark tutorial: link to tutorial
After installing it on local machine (Win10 64, Python 3, Spark 2.4.0) and setting all env variables (HADOOP_HOME, SPARK_HOME etc) I'm trying to run a simple Spark job via WordCount.py file:
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
conf = SparkConf().setAppName("word count").setMaster("local[2]")
sc = SparkContext(conf = conf)
lines = sc.textFile("C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/in/word_count.text")
words = lines.flatMap(lambda line: line.split(" "))
wordCounts = words.countByValue()
for word, count in wordCounts.items():
print("{} : {}".format(word, count))
After running it from terminal:
spark-submit WordCount.py
I get below error. I checked (by commenting out line by line) that it crashes at
wordCounts = words.countByValue()
Any idea what should I check to make it work?
Traceback (most recent call last):
File "C:\Users\mjdbr\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\mjdbr\Anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 25, in <module>
ModuleNotFoundError: No module named 'resource'
18/11/10 23:16:58 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker failed to connect back.
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
at java.net.PlainSocketImpl.accept(Unknown Source)
at java.net.ServerSocket.implAccept(Unknown Source)
at java.net.ServerSocket.accept(Unknown Source)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
... 14 more
18/11/10 23:16:58 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/rdd/WordCount.py", line 19, in <module>
wordCounts = words.countByValue()
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 1261, in countByValue
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 844, in reduce
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 816, in collect
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
at java.net.PlainSocketImpl.accept(Unknown Source)
at java.net.ServerSocket.implAccept(Unknown Source)
at java.net.ServerSocket.accept(Unknown Source)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
... 14 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
at java.net.PlainSocketImpl.accept(Unknown Source)
at java.net.ServerSocket.implAccept(Unknown Source)
at java.net.ServerSocket.accept(Unknown Source)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
... 14 more
As suggested by theplatypus - checked if the 'resource' module can be imported directly from terminal - apparently not:
>>> import resource
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'resource'
In terms of installation resources - I followed instructions from this tutorial:
- downloaded spark-2.4.0-bin-hadoop2.7.tgz from Apache Spark website
- un-zipped it to my C-drive
- already had Python_3 installed (Anaconda distribution) as well as Java
- created local 'C:\hadoop\bin' folder to store winutils.exe
- created 'C:\tmp\hive' folder and gave Spark access to it
- added environment variables (SPARK_HOME, HADOOP_HOME etc)
Is there any extra resource I should install?
I got the same error. I solved it installing the previous version of Spark (2.3 instead of 2.4). Now it works perfectly, maybe it is an issue of the lastest version of pyspark.
这篇关于Python 工作者无法重新连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!