创建实时数据仓库 [英] Creating real time datawarehouse

查看:82
本文介绍了创建实时数据仓库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个个人项目,其中包括创建数据仓库(DWH)的完整体系结构。在这种情况下,我决定使用Pentaho作为ETL和BI分析工具。从允许轻松创建仪表板到完整的数据挖掘过程和OLAP多维数据集,它具有许多功能。

I am doing a personal project that consists of creating the full architecture of a data warehouse (DWH). In this case as an ETL and BI analysis tool I decided to use Pentaho; it has a lot of functionality from allowing easy dashboard creation, to full data mining processes and OLAP cubes.

我已经读过一个数据仓库必须是一个关系数据库,并且理解这一点。我不了解如何实现近实时或全实时DWH。我已经阅读了有关推和拉策略的信息,但是我的结论如下:

I have read that a data warehouse must be a relational database, and understand this. What I don't understand is how to achieve a near real time, or fully real time DWH. I have read about push and pull strategies but my conclusions are the following:


  • 选择DBMS对于创建实时DWH并不重要。我的意思是,使用MySQL,SQL Server,Oracle或其他任何工具都是可能的。作为个人项目,我选择MySQL。

  • 关键因素是作业调度的频率,这是调度程序的任务。这个假设正确吗?我的意思是,创建实时DWH的关键是为每个ETL流程每秒建立工作?

如果我错了,可以您向我提供了一些帮助来理解这一点?然后,创建实时DWH的方法是什么?是否有任何允许这样做的开源调度程序?

If I am wrong can you provide me some help to understand this? And then, which is the way to create a real time DWH? Is the any open source scheduler that allows that? And any not open source scheduler which allows that?

我很困惑,因为有些参考文献说这是不可能的,另一些则是可能的。

I am very confused because some references say that this is impossible, others that is possible.

推荐答案

定义

非常有趣的问题。首先,应该定义实时实时性。实时确实对传入数据具有非常低的延迟,但在发送系统中需要良好的体系结构,可能是事件总线或消息传递队列,并且在接收端需要良好的基础结构。这通常涉及某种侦听器和来自递延系统的推送

Very interesting question. First of all, it should be defined how "real-time" realtime should be. Realtime really has a very low latency for incoming data but requires good architecture in the sending systems, maybe a event bus or messaging queue and good infrastructure on the receiving end. This usually involves some kind of listener and pushing from the deliviering systems.

近实时将是下一个较低级别。如果我们说接近实时最大延迟约5分钟,那么您的方法也可以工作。因此,举例来说,您可以每分钟左右一次数据。但是请记住,您需要某种高性能的检查,以了解是否有可用的新数据以及如何获取新数据。如果此检查和拉动要花费一分钟以上的时间,那么将很难跟上数据的速度。确实取决于音量。

Near-realtime would be the next "lower" level. If we say near-realtime would be about 5 minutes delay max, your approach could work as well. So for example here you could pull every minute or so the data. But keep in mind that you need some kind of high-performance check if new data is available and which to get. If this check and the pull would take longer than a minute it would become harder to keep up with the data. Really depends on the volume.

实时

正如我之前所说,实时分析充其量仅需要一条消息队列或一条服务总线,您的某些工作就可以连接并监听新数据。如果将新的数据包推入管道,则它的大小可能很小,并且可以非常快速地处理。

As I said before, realtime analytics require at best a messaging queue or a service bus some jobs of yours could connect to and "listen" for new data. If a new data package is pushed into the pipeline, the size of it will probably be very small and it can be processed very fast.

如果没有用于侦听器的基础结构,则需要接近实时。

If there is no infrastructure for listeners, you need to go near-realtime.

近实时

此是您必须进一步发展的部分。您必须确保获取实际的小数据包,这些数据包通常是某种增量。如果您有权访问数据库,则可以使用触发器来完成。否则,您必须不时拉一次,而您的一次可能会很频繁。

This is the part where you have to develop more. You have to make sure to get realtively small data packages which will usually be some kind of delta. This could be done with triggers if you have access to the database. Otherwise you have to pull every once in a while whereas your "once" will probably be very frequent.

例如,可以在Linux上进行简单的工作,或者在Windows上进行事件计划。请记住,在开始下一个作业之前,您的加载和处理时间不应超过您获得的时间范围。

This could be done on Linux for example with a simple conjob or on Windows with event planning. Just keep in mind that your loading and processing time shouldn't exceed the time window you have got until the next job is being started.

数据库 >

最后,当您定义要实现的目标并大致了解如何实现增量加载或侦听器时,您是对的-您可以采用关系型数据库。如果您对性能感兴趣并将该零件建模为Star Schema,则还可以研究基于列的引擎或基于列的数据库,例如Apache Cassandra。

In the end, when you defined what you want to achieve and have a general idea how to implement delta loading or listeners, you are right - you could take a relational database. If you are interested in performance and are modelling this part as Star Schema, you also could look into Column Based Engines or Column Based Databases like Apache Cassandra.

排程

对于工作计划,您也可以从Linux或Windows标准计划工具开始。如果您使用Java编写代码,则以后可以使用诸如石英之类的东西。但这仅是近实时的情况。如上所述,实时性需要不同的体系结构。

Also for job scheduling you could start with Linux or Windows standard planning tools. If you code in Java you could use later something like quartz. But this would only be the case for near-realtime. Realtime requires a different architecture as I explained above.

这篇关于创建实时数据仓库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆