PostgreSQL到数据仓库:接近实时ETL /数据提取的最佳方法 [英] PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

查看:675
本文介绍了PostgreSQL到数据仓库:接近实时ETL /数据提取的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景:

我有一个针对OLTP进行了优化的PostgreSQL(v8.3)数据库。

I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP.

我需要以半实时的方式从中提取数据(有人一定会问半实时是什么意思,而答案是我可以,但是我会很务实,因为基准测试可以说我们希望每15分钟一次),然后将其输入到数据仓库中。

I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse.

多少数据?在高峰时间,我们正在谈论每分钟大约80-100k行到达OLTP端,在非高峰期,这将大大下降到15-20k。更新最频繁的行各为〜64字节,但是有各种表等,因此数据非常多样化,每行的最大范围为4000字节。 OLTP处于活动状态24x5.5。

How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row. The OLTP is active 24x5.5.

最佳解决方案?

来自什么我可以拼凑出最实用的解决方案,如下所示:

From what I can piece together the most practical solution is as follows:


  • 创建触发器,将所有DML活动写入旋转的CSV日志文件

  • 执行所需的任何转换

  • 使用本地DW数据泵工具将转换后的CSV有效地泵入DW

为什么采用这种方法?


  • 触发允许选择目标表而不是系统范围表+输出是可配置的(即转换为CSV),并且相对容易编写和部署。 SLONY使用类似的方法,并且可以接受开销

  • CSV易于快速转换

  • 易于将CSV泵入DW

考虑的替代方案....


  • 使用本机日志记录( http:/ /www.postgresql.org/docs/8.3/static/runtime-config-logging.html )。问题是它相对于我需要的内容显得非常冗长,并且在解析和转换时有些棘手。但是,因为我认为与TRIGGER相比开销较小,所以它可能会更快。当然,这将使管理员更容易,因为它是系统范围的,但是同样,我不需要某些表(有些表用于永久存储我不想记录的JMS消息)

  • 直接通过诸如Talend之类的ETL工具查询数据并将其泵入DW ...问题是OLTP模式需要进行调整以支持这一点,并且具有许多负面影响

  • 使用经过调整/被黑客入侵的SLONY-SLONY很好地记录了更改并将其迁移到从属服务器,因此概念框架存在,但建议的解决方案似乎更容易,更干净

  • 使用WAL

  • Using native logging (http://www.postgresql.org/docs/8.3/static/runtime-config-logging.html). Problem with this is it looked very verbose relative to what I needed and was a little trickier to parse and transform. However it could be faster as I presume there is less overhead compared to a TRIGGER. Certainly it would make the admin easier as it is system wide but again, I don't need some of the tables (some are used for persistent storage of JMS messages which I do not want to log)
  • Querying the data directly via an ETL tool such as Talend and pumping it into the DW ... problem is the OLTP schema would need tweaked to support this and that has many negative side-effects
  • Using a tweaked/hacked SLONY - SLONY does a good job of logging and migrating changes to a slave so the conceptual framework is there but the proposed solution just seems easier and cleaner
  • Using the WAL

有人做过吗?想分享您的想法吗?

Has anyone done this before? Want to share your thoughts?

推荐答案

假设您感兴趣的表具有(或可以扩充)唯一的,已编制索引的,顺序键,那么通过简单地发出 SELECT ... FROM table ... WHERE key> ;,您将获得更好的价值。 :last_max_key ,并输出到文件,其中 last_max_key 是最后一次提取的最后一个键值(如果是第一次提取,则为0。)增量,分离的方法避免了在插入数据路径中引入触发延迟(无论是自定义触发器还是修改后的Slony),并且取决于您的设置可能会随着CPU数量的增加而更好地扩展。(但是,如果您还必须跟踪 UPDATE s ,并且您添加了顺序密钥,则您的 UPDATE 语句应将键列的 SET 设置为 NULL ,以便它获取新值并由下一个选择提取。您将无法在没有触发的情况下跟踪 DELETE s 。)这是您提到Talend时想到的吗?

Assuming that your tables of interest have (or can be augmented with) a unique, indexed, sequential key, then you will get much much better value out of simply issuing SELECT ... FROM table ... WHERE key > :last_max_key with output to a file, where last_max_key is the last key value from the last extraction (0 if first extraction.) This incremental, decoupled approach avoids introducing trigger latency in the insertion datapath (be it custom triggers or modified Slony), and depending on your setup could scale better with number of CPUs etc. (However, if you also have to track UPDATEs, and the sequential key was added by you, then your UPDATE statements should SET the key column to NULL so it gets a new value and gets picked by the next extraction. You would not be able to track DELETEs without a trigger.) Is this what you had in mind when you mentioned Talend?

除非您无法实施上述解决方案,否则我将不使用日志记录功能;日志记录最有可能涉及锁定开销,以确保日志行被顺序写入,并且当多个后端写入日志时不相互重叠/覆盖(请检查Postgres源。)锁定开销可能不会造成灾难性的影响,但是如果可以使用增量 SELECT 替代方法,则可以不使用它。此外,语句日志记录将淹没任何有用的警告或错误消息,并且解析本身不会是瞬时的

I would not use the logging facility unless you cannot implement the solution above; logging most likely involves locking overhead to ensure log lines are written sequentially and do not overlap/overwrite each other when multiple backends write to the log (check the Postgres source.) The locking overhead may not be catastrophic, but you can do without it if you can use the incremental SELECT alternative. Moreover, statement logging would drown out any useful WARNING or ERROR messages, and the parsing itself will not be instantaneous.

除非您愿意解析WAL(包括事务状态跟踪,并且每次升级Postgres都准备好重写代码),否则我也不一定会使用WAL,也就是说,除非您拥有额外的可用硬件,在这种情况下,您可以将WAL运送到另一台计算机上进行提取(在第二台计算机上,您可以无耻地使用触发器-甚至是语句记录-因为发生的任何事情都不会影响 INSERT / UPDATE / Delete 请注意,在性能方面(在主计算机上),除非您可以将日志写入SAN,否则从交付的WAL中将获得相当的性能影响(主要是在文件系统高速缓存方面)。到另一个町与运行增量 SELECT 一样。

Unless you are willing to parse WALs (including transaction state tracking, and being ready to rewrite the code everytime you upgrade Postgres) I would not necessarily use the WALs either -- that is, unless you have the extra hardware available, in which case you could ship WALs to another machine for extraction (on the second machine you can use triggers shamelessly -- or even statement logging -- since whatever happens there does not affect INSERT/UPDATE/DELETE performance on the primary machine.) Note that performance-wise (on the primary machine), unless you can write the logs to a SAN, you'd get a comparable performance hit (in terms of thrashing filesystem cache, mostly) from shipping WALs to a different machine as from running the incremental SELECT.

这篇关于PostgreSQL到数据仓库:接近实时ETL /数据提取的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆