Python 工作者无法重新连接 [英] Python worker failed to connect back

查看:31
本文介绍了Python 工作者无法重新连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Spark 的新手,正在尝试完成 Spark 教程:教程链接

在本地机器(Win10 64、Python 3、Spark 2.4.0)上安装它并设置所有环境变量(HADOOP_HOME、SPARK_HOME 等)后,我试图通过 WordCount.py 文件运行一个简单的 Spark 作业:

from pyspark import SparkContext, SparkConf如果 __name__ == "__main__":conf = SparkConf().setAppName("word count").setMaster("local[2]")sc = SparkContext(conf = conf)lines = sc.textFile("C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/in/word_count.text")words = lines.flatMap(lambda line: line.split(" "))wordCounts = words.countByValue()对于单词,在 wordCounts.items() 中计数:打印({}:{}".格式(字,计数))

从终端运行后:

spark-submit WordCount.py

我收到以下错误.我检查(通过逐行注释)它在

崩溃

wordCounts = words.countByValue()

知道我应该检查什么才能使它工作吗?

回溯(最近一次调用最后一次):文件C:\Users\mjdbr\Anaconda3\lib\runpy.py",第 193 行,在 _run_module_as_main"__main__", mod_spec)_run_code 中的文件C:\Users\mjdbr\Anaconda3\lib\runpy.py",第 85 行exec(代码,run_globals)文件C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py",第 25 行,在 <module> 中.ModuleNotFoundError: 没有名为资源"的模块18/11/10 23:16:58 错误执行程序:阶段 0.0 中任务 0.0 中的异常(TID 0)org.apache.spark.SparkException:Python 工作线程无法重新连接.在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)在 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)在 org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)在 org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)在 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)在 org.apache.spark.scheduler.Task.run(Task.scala:121)在 org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)在 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)在 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)在 java.util.concurrent.ThreadPoolExecutor.runWorker(来源不明)在 java.util.concurrent.ThreadPoolExecutor$Worker.run(未知来源)在 java.lang.Thread.run(未知来源)引起:java.net.SocketTimeoutException:接受超时在 java.net.DualStackPlainSocketImpl.waitForNewConnection(本机方法)在 java.net.DualStackPlainSocketImpl.socketAccept(来源不明)在 java.net.AbstractPlainSocketImpl.accept(未知来源)在 java.net.PlainSocketImpl.accept(未知来源)在 java.net.ServerSocket.implAccept(Unknown Source)在 java.net.ServerSocket.accept(Unknown Source)在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)……还有 14 个18/11/10 23:16:58 错误 TaskSetManager:阶段 0.0 中的任务 0 失败了 1 次;中止工作回溯(最近一次调用最后一次):文件C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/rdd/WordCount.py",第 19 行,在 <module> 中wordCounts = words.countByValue()文件C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py",第 1261 行,在 countByValue 中文件C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py",第844行,reduce文件C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py",第816行,收集文件C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py",第 1257 行,在 __call__文件C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py",第 328 行,在 get_return_value 中py4j.protocol.Py4JJavaError:调用 z:org.apache.spark.api.python.PythonRDD.collectAndServe 时出错.:org.apache.spark.SparkException:由于阶段失败而中止作业:阶段 0.0 中的任务 0 失败 1 次,最近一次失败:在阶段 0.0 中丢失任务 0.0(TID 0、本地主机、执行程序驱动程序):org.apache.spark.SparkException:Python 工作线程无法连接回.在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)在 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)在 org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)在 org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)在 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)在 org.apache.spark.scheduler.Task.run(Task.scala:121)在 org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)在 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)在 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)在 java.util.concurrent.ThreadPoolExecutor.runWorker(来源不明)在 java.util.concurrent.ThreadPoolExecutor$Worker.run(未知来源)在 java.lang.Thread.run(未知来源)引起:java.net.SocketTimeoutException:接受超时在 java.net.DualStackPlainSocketImpl.waitForNewConnection(本机方法)在 java.net.DualStackPlainSocketImpl.socketAccept(来源不明)在 java.net.AbstractPlainSocketImpl.accept(未知来源)在 java.net.PlainSocketImpl.accept(未知来源)在 java.net.ServerSocket.implAccept(Unknown Source)在 java.net.ServerSocket.accept(Unknown Source)在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)……还有 14 个驱动程序堆栈跟踪:在 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)在 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)在 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)在 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)在 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)在 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)在 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)在 scala.Option.foreach(Option.scala:257)在 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)在 org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)在 org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)在 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)在 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)在 org.apache.spark.rdd.RDD.withScope(RDD.scala:363)在 org.apache.spark.rdd.RDD.collect(RDD.scala:944)在 org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)在 org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)在 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)在 sun.reflect.NativeMethodAccessorImpl.invoke(来源不明)在 sun.reflect.DelegatingMethodAccessorImpl.invoke(未知来源)在 java.lang.reflect.Method.invoke(Unknown Source)在 py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在 py4j.Gateway.invoke(Gateway.java:282)在 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)在 py4j.commands.CallCommand.execute(CallCommand.java:79)在 py4j.GatewayConnection.run(GatewayConnection.java:238)在 java.lang.Thread.run(未知来源)引起:org.apache.spark.SparkException:Python worker 无法连接回.在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)在 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)在 org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)在 org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)在 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)在 org.apache.spark.scheduler.Task.run(Task.scala:121)在 org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)在 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)在 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)在 java.util.concurrent.ThreadPoolExecutor.runWorker(来源不明)在 java.util.concurrent.ThreadPoolExecutor$Worker.run(未知来源)... 1 更多引起:java.net.SocketTimeoutException:接受超时在 java.net.DualStackPlainSocketImpl.waitForNewConnection(本机方法)在 java.net.DualStackPlainSocketImpl.socketAccept(来源不明)在 java.net.AbstractPlainSocketImpl.accept(未知来源)在 java.net.PlainSocketImpl.accept(未知来源)在 java.net.ServerSocket.implAccept(Unknown Source)在 java.net.ServerSocket.accept(Unknown Source)在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)……还有 14 个

正如鸭嘴兽所建议的 - 检查资源"模块是否可以直接从终端导入 - 显然不是:

<预><代码>>>>导入资源回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中ModuleNotFoundError: 没有名为资源"的模块

在安装资源方面 - 我遵循了的说明本教程:

  1. Apache Spark 网站下载了 spark-2.4.0-bin-hadoop2.7.tgz
  2. 将其解压缩到我的 C 驱动器
  3. 已经安装了 Python_3(Anaconda 发行版)和 Java
  4. 创建本地C:\hadoop\bin"文件夹来存储 winutils.exe
  5. 创建C:\tmp\hive"文件夹并授予 Spark 访问权限
  6. 添加环境变量(SPARK_HOME、HADOOP_HOME 等)

我应该安装任何额外的资源吗?

解决方案

我遇到了同样的错误.我通过安装以前版本的 Spark(2.3 而不是 2.4)解决了这个问题.现在完美了,可能是pyspark最新版本的问题.

I'm a newby with Spark and trying to complete a Spark tutorial: link to tutorial

After installing it on local machine (Win10 64, Python 3, Spark 2.4.0) and setting all env variables (HADOOP_HOME, SPARK_HOME etc) I'm trying to run a simple Spark job via WordCount.py file:

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":
    conf = SparkConf().setAppName("word count").setMaster("local[2]")
    sc = SparkContext(conf = conf)

    lines = sc.textFile("C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/in/word_count.text")
    words = lines.flatMap(lambda line: line.split(" "))
    wordCounts = words.countByValue()

    for word, count in wordCounts.items():
        print("{} : {}".format(word, count))

After running it from terminal:

spark-submit WordCount.py

I get below error. I checked (by commenting out line by line) that it crashes at

wordCounts = words.countByValue()

Any idea what should I check to make it work?

Traceback (most recent call last):
  File "C:\Users\mjdbr\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\mjdbr\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 25, in <module>
ModuleNotFoundError: No module named 'resource'
18/11/10 23:16:58 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
        at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
        at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
        at java.net.PlainSocketImpl.accept(Unknown Source)
        at java.net.ServerSocket.implAccept(Unknown Source)
        at java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
        ... 14 more
18/11/10 23:16:58 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "C:/Users/mjdbr/Documents/BigData/python-spark-tutorial/rdd/WordCount.py", line 19, in <module>
    wordCounts = words.countByValue()
  File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 1261, in countByValue
  File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 844, in reduce
  File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 816, in collect
  File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
  File "C:\Spark\spark-2.4.0-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
        at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
        at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
        at java.net.PlainSocketImpl.accept(Unknown Source)
        at java.net.ServerSocket.implAccept(Unknown Source)
        at java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
        ... 14 more

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
        at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
        at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:170)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:97)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
        at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        ... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
        at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
        at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
        at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
        at java.net.PlainSocketImpl.accept(Unknown Source)
        at java.net.ServerSocket.implAccept(Unknown Source)
        at java.net.ServerSocket.accept(Unknown Source)
        at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:164)
        ... 14 more

As suggested by theplatypus - checked if the 'resource' module can be imported directly from terminal - apparently not:

>>> import resource
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'resource'

In terms of installation resources - I followed instructions from this tutorial:

  1. downloaded spark-2.4.0-bin-hadoop2.7.tgz from Apache Spark website
  2. un-zipped it to my C-drive
  3. already had Python_3 installed (Anaconda distribution) as well as Java
  4. created local 'C:\hadoop\bin' folder to store winutils.exe
  5. created 'C:\tmp\hive' folder and gave Spark access to it
  6. added environment variables (SPARK_HOME, HADOOP_HOME etc)

Is there any extra resource I should install?

解决方案

I got the same error. I solved it installing the previous version of Spark (2.3 instead of 2.4). Now it works perfectly, maybe it is an issue of the lastest version of pyspark.

这篇关于Python 工作者无法重新连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆