当部署在YARN中时,Apache Spark如何处理系统故障? [英] How does Apache Spark handles system failure when deployed in YARN?

查看:92
本文介绍了当部署在YARN中时,Apache Spark如何处理系统故障?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

前提条件

我们假设使用YARN将Apache Spark部署在hadoop集群上.此外,正在执行火花. Spark如何处理下面列出的情况?

Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?

案例和&问题

  1. hadoop群集的一个节点由于磁盘错误而发生故障.但是,复制足够高,并且不会丢失任何数据.
    • 在该节点上运行的任务会发生什么?
  1. One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
    • What will happen to tasks that where running at that node?
  • 它将如何处理这种情况?
  • spark是否自动使用了故障转移名称节点?
  • 当辅助名称节点也发生故障时会发生什么?
  • Did spark automatically use the fail over namenode?
  • What happens when the secondary namenode fails as well?
  • 星火会自动在集群中重新启动吗?
  • 它会恢复到工作流程中的最后一个保存"点吗?
  • Will spark restart with the cluster automatically?
  • Will it resume to the last "save" point during the work flow?

我知道,有些问题听起来很奇怪.无论如何,我希望您能回答部分或全部问题. 提前致谢. :)

I know, some questions might sound odd. Anyway, I hope you can answer some or all. Thanks in advance. :)

推荐答案

以下是

Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):

  1. "Spark将在其他节点上重新运行这些任务."
  2. 在许多失败的任务尝试读取该块之后,Spark会放弃HDFS返回的任何错误,从而使作业失败."
  3. "Spark通过普通的HDFS客户端API访问HDFS.在HA配置下,这些将自动故障转移到新的namenode.如果没有剩余namenode,则Spark作业将失败."
  4. 重新启动是管理的一部分,并且"Spark支持对HDFS的检查点,因此您可以返回到上次被称为HDFS的检查点的时间."
  1. "Spark will rerun those tasks on a different node."
  2. "After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
  3. "Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
  4. Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."

这篇关于当部署在YARN中时,Apache Spark如何处理系统故障?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆