填充Reporting / Data Warehouse数据库的策略 [英] Strategies for populating a Reporting/Data Warehouse database

查看:105
本文介绍了填充Reporting / Data Warehouse数据库的策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我们的报告应用程序,我们有一个过程,每天将几个数据库聚合到一个单独的报告数据库中。报告数据库的模式与我们聚合的单独的生产数据库的模式非常不同,因此有大量的业务逻辑涉及数据的聚合方式。

For our reporting application, we have a process that aggregates several databases into a single 'reporting' database on a nightly basis. The schema of the reporting database is quite different than that of the separate 'production' databases that we are aggregating so there is a good amount of business logic that goes into how the data is aggregated.

现在这个过程由几个每夜运行的存储过程实现。当我们向报告数据库添加更多详细信息时,存储过程中的逻辑不断变得更加脆弱和难以管理。

Right now this process is implemented by several stored procedures that run nightly. As we add more details to the reporting database the logic in the stored procedures keeps growing more fragile and unmanageable.

什么是可用于填充此报告的其他策略数据库?

What are some other strategies that could be used to populate this reporting database?


  • SSIS?这已经被考虑,但似乎不提供一种更清洁,更可维护的方法只是存储过程。

  • 一个单独的C#(或任何语言)进程,将内存中的数据聚合,然后将其推送到报告数据库中?这将允许我们为逻辑写单元测试,并以更可维护的方式组织代码。

我在寻找任何新的想法或额外的想法上面。感谢!

I'm looking for any new ideas or additional thoughts on the above. Thanks!

推荐答案

我们的一般程序是:


  1. 将数据从源表复制到加载数据库中的
    表中的
    表中

  2. 将数据转换为分段
    具有相同结构的
    作为最终事实/维度表

  3. 将数据从暂存表复制到
    事实/维度表

SSIS适用于步骤1,它或多或少是1:1复制过程,具有一些基本的数据类型映射和字符串转换。

SSIS is good for step 1, which is more or less a 1:1 copy process, with some basic data type mappings and string transformations.

对于第2步,我们使用了存储过程,.NET和Python的混合。大多数逻辑是在程序中,与外部代码中的重解析。纯TSQL的主要优点是,转换常常取决于加载数据库中的其他数据,例如。使用SQL JOIN中的映射表比在外部脚本中执行逐行查找过程要快得多,即使使用缓存也是如此。不可否认,这只是我的经验,程序性的处理可能更适合于syour数据集。

For step 2, we use a mix of stored procs, .NET and Python. Most of the logic is in procedures, with things like heavy parsing in external code. The major benefit of pure TSQL is that very often transformations depend on other data in the loading database, e.g. using mapping tables in a SQL JOIN is much faster than doing a row-by-row lookup process in an external script, even with caching. Admittedly, that's just my experience, and procedural processing might be better for syour data set.

在一些情况下,我们必须做一些复杂的解析和TSQL只是不是一个可行的解决方案。所以这就是我们使用外部.NET或Python代码来完成这项工作。我想我们可以在.NET过程/函数中做到这一点,并将它保存在数据库中,但还需要其他外部连接,因此单独的程序是有意义的。

In a few cases we do have to do some complex parsing (of DNA sequences) and TSQL is just not a viable solution. So that's where we use external .NET or Python code to do the work. I suppose we could do it all in .NET procedures/functions and keep it in the database, but there are other external connections required, so a separate program makes sense.

步骤3是一系列INSERT ... SELECT ...语句:它的速度很快。

Step 3 is a series of INSERT... SELECT... statements: it's fast.

所以,使用最好的工具,担心混合的东西。 SSIS包或软件包是将存储过程,可执行文件和其他任何需要执行的操作链接在一起的好方法,因此您可以在一个地方设计,执行和记录整个加载过程。如果这是一个巨大的进程,你可以使用子包。

So all in all, use the best tool for the job, and don't worry about mixing things up. An SSIS package - or packages - is a good way to link together stored procedures, executables and whatever else you need to do, so you can design, execute and log the whole load process in one place. If it's a huge process, you can use subpackages.

我知道你的意思是什么TSQL感觉尴尬(实际上,我发现它比任何其他重复)对于数据驱动操作非常非常快。所以我的感觉是,在TSQL和字符串处理等数据处理或其他复杂的操作在外部代码。

I know what you mean about TSQL feeling awkward (actually, I find it more repetitive than anything else), but it is very, very fast for data-driven operations. So my feeling is, do data processing in TSQL and string processing or other complex operations in external code.

这篇关于填充Reporting / Data Warehouse数据库的策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆