Lambda 架构 - 为什么是批处理层 [英] Lambda Architecture - Why batch layer

查看:31
本文介绍了Lambda 架构 - 为什么是批处理层的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究 lambda 架构并了解如何使用它来构建容错大数据系统.

I am going through the lambda architecture and understanding how it can be used to build fault tolerant big data systems.

我想知道当所有内容都可以存储在实时视图中并从中生成结果时,批处理层有什么用?是不是因为实时存储不能用于存储所有数据,那么它就不会是实时的,因为检索数据所花费的时间取决于存储数据所花费的空间.

I am wondering how batch layer is useful when everything can be stored in realtime view and generate the results out of it? is it because realtime storage cant be used to store all of the data, then it wont be realtime as the time taken to retrieve the data is dependent on the the space it took for the data to store.

推荐答案

为什么要使用批处理层

Why batch layer

为了节省时间和金钱!

它基本上有两个功能,

  • 管理主数据集(假设是不可变的)
  • 预先计算临时查询的批处理视图

一切都可以存储在实时视图中并从中生成结果 - 不正确

Everything can be stored in realtime view and generate the results out of it - NOT TRUE

以上当然是可能的,但不可行,因为数据可能是 100..1000 PB,生成结果可能需要时间.. 很多时间!

The above is certainly possible, but not feasible as data could be 100's..1000's of petabytes and generating results could take time.. a lot of time!

这里的关键是实现对大型数据集的低延迟查询.批处理层用于创建批处理视图(低延迟的查询),实时层用于最近/更新的数据,这些数据通常很小.现在,可以通过合并来自批处理视图和实时视图的结果来回答任何临时查询,而不是对所有主数据集进行计算.

Key here, is to attain low-latency queries over large dataset. Batch layer is used for creating batch views (queries served with low-latency) and realtime layer is used for recent/updated data which is usually small. Now, any ad-hoc query can be answered by merging results from batch views and real-time views instead of computing over all the master dataset.

另外,想想一个查询(相同的查询?)在庞大的数据集上一次又一次地运行......浪费时间和金钱!

Also, think of a query (same query?) running again and again over huge dataset.. loss of time and money!

这篇关于Lambda 架构 - 为什么是批处理层的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆