在获取星火错误:执行程序丢失 [英] Getting error in Spark: Executor lost

查看:424
本文介绍了在获取星火错误:执行程序丢失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个主机和两个从32 GB的RAM每个运行和我读与约18万条记录csv文件(第一行是为列标题)。

I have one master and two slaves each running on 32 GB of RAM and I'm reading a csv file with around 18 million records (the first row are the headers for the columns).

这是我使用运行作业的命令

This is the command I am using to run the job

./spark-submit --master yarn --deploy-mode client --executor-memory 10g <path/to/.py file>

我做了以下

rdd = sc.textFile("<path/to/file>")
h = rdd.first()
header_rdd = rdd.map(lambda l: h in l)
data_rdd = rdd.subtract(header_rdd)
data_rdd.first()

我收到以下错误消息 -

I'm getting the following error message -

15/10/12 13:52:03 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 192.168.1.114:51058
15/10/12 13:52:03 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 192.168.1.114:51058
15/10/12 13:52:03 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkYarnAM@192.168.1.114:51058] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/10/12 13:52:03 ERROR cluster.YarnScheduler: Lost executor 1 on hslave2: remote Rpc client disassociated
15/10/12 13:52:03 INFO scheduler.TaskSetManager: Re-queueing tasks for 1 from TaskSet 3.0
15/10/12 13:52:03 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@hslave2:58555] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/10/12 13:52:03 WARN scheduler.TaskSetManager: Lost task 6.6 in stage 3.0 (TID 208, hslave2): ExecutorLostFailure (executor 1 lost)

此错误来临了rdd.subtract()正在运行时。然后,我修改了code和除去rdd.subtract(),并用一个rdd.filter()

This error was coming up when the rdd.subtract() was running. Then, I modified the code and removed the rdd.subtract() and replaced it with a rdd.filter()

修改code - >

Modified code ->

rdd = sc.textFile("<path/to/file>")
h = rdd.first()
data_rdd = rdd.filter(lambda l: h not in l)

但我得到了同样的错误。

But I got the same error.

有谁知道什么是执行者背后的原因迷路?

Does anyone know what are the reasons behind the executor getting lost?

是因为在群集中运行的机器内存不足?

Is it because of inadequate memory in the machines running the cluster?

推荐答案

这是不是每一个本身星火错误,但可能与你的Java,纱,你的星火-config文件,设置

This isn't a Spark bug per-se, but is probably related to the settings you have for Java, Yarn, and your Spark-config file.

看的http://apache-spark-user-list.1001560.n3.nabble.com/Executor-Lost-Failure-td18486.html

您会想增加你的Java内存,增加你的阿卡框架尺寸,增加了阿卡超时设置等。

You'll want to increase your Java memory, increase you akka framesize, increase the akka timeout settings, etc.

请尝试以下spark.conf:

Try the following spark.conf:

spark.master                       yarn-cluster
spark.yarn.historyServer.address   <your cluster url>
spark.eventLog.enabled             true
spark.eventLog.dir                 hdfs://<your history directory>
spark.driver.extraJavaOptions      -Xmx20480m -XX:MaxPermSize=2048m -XX:ReservedCodeCacheSize=2048m
spark.checkpointDir                hdfs://<your checkpoint directory>
yarn.log-aggregation-enable        true
spark.shuffle.service.enabled      true
spark.shuffle.service.port         7337
spark.shuffle.consolidateFiles     true
spark.sql.parquet.binaryAsString   true
spark.speculation                  false
spark.yarn.maxAppAttempts          1
spark.akka.askTimeout              1000
spark.akka.timeout                 1000
spark.akka.frameSize               1000
spark.rdd.compress true
spark.storage.memoryFraction 1
spark.core.connection.ack.wait.timeout 600
spark.driver.maxResultSize         0
spark.task.maxFailures             20
spark.shuffle.io.maxRetries        20

您可能也想玩弄你多少分区要求你内心的星火计划,你可能希望将一些partitionBy(分区)语句添加到您的RDDS,所以你的code可能是这样的:

You might also want to play around with how many partitions you are requesting inside you Spark program, and you may want to add some partitionBy(partitioner) statements to your RDDs, so your code might be this:

myPartitioner = new HashPartitioner(<your number of partitions>)

rdd = sc.textFile("<path/to/file>").partitionBy(myPartitioner)
h = rdd.first()
header_rdd = rdd.map(lambda l: h in l)
data_rdd = rdd.subtract(header_rdd)
data_rdd.first()

最后,你可能需要摆​​弄你的火花提交命令,并添加参数执行人的数量,遗嘱执行人内存和驾驶员记忆

Finally, you may need to play around with your spark-submit command and add parameters for number of executors, executor memory, and driver memory

./spark-submit --master yarn --deploy-mode client --num-executors 100 --driver-memory 20G --executor-memory 10g <path/to/.py file>

这篇关于在获取星火错误:执行程序丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆