Spark 中的血统是什么? [英] What is Lineage In Spark?

查看:24
本文介绍了Spark 中的血统是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谱系如何帮助重新计算数据?

How lineage helps to recompute data?

例如,我有多个节点,每个节点计算数据 30 分钟.如果 15 分钟后失败,我们是否可以再次使用 lineage 重新计算 15 分钟内处理过的数据,而无需再次给出 15 分钟?

For example, I'm having several nodes computing data for 30 minutes each. If one fails after 15 minutes, can we recompute data processed in 15 minutes again using lineage without giving 15 minutes again?

推荐答案

关于血统的一切都在 RDD 的定义中.

Everything to understand about lineage is in the definition of RDD.

让我们回顾一下:

RDD 是数据元素的不可变分布式集合,可以跨机器集群存储在内存或磁盘中.数据跨集群中的机器进行分区,这些机器可以与提供转换和操作的低级 API 并行操作.RDD 具有容错能力,因为它们会跟踪数据沿袭信息以在出现故障时自动重建丢失的数据

RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure

所以主要有两件事需要理解:

So there is mainly 2 things to understand:

Spark 在内部如何工作?

不幸的是,在一个答案中讨论这些主题的时间很长.我建议您花一些时间阅读以下关于数据沿袭的文章.

Unfortunately, these topics are quite long to discuss in a single answer. I recommend you take some time reading them along with this following article about Data Lineage.

现在回答您的问题和疑虑:

And now to answer your question and doubts:

如果执行器无法计算您的数据,则 15 分钟后,它会返回到您的最后一个检查点,无论是来自还是缓存em> 在内存和/或磁盘上.

If an executor fails computing your data, after 15 minutes, it will go back to your last checkpoint, whether it's from the source or cache in memory and/or on disk.

因此,它不会为您节省您提到的那 15 分钟!

Thus, it will not save you those 15 minutes that you have mentioned!

这篇关于Spark 中的血统是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆