时间序列事件的数据库建议 [英] Database suggestions for time series of events

查看:212
本文介绍了时间序列事件的数据库建议的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我的一个项目,我必须输入一个大的事件集合到一个数据库以供以后处理,我试图决定哪个DBMS最适合我的目的。

For one of my projects, I have to enter a big-ish collection of events into a database for later processing and I am trying to decide which DBMS would be best for my purpose.

我有:


  • 目前有大约400,000,000个离散事件

  • About 400,000,000 discrete events at the moment

将存储在数据库中的大约600 GB数据

About 600 GB of data that will be stored in the DB

有各种格式,但我估计个人属性的数量约为5,000。大多数事件只包含大约100个属性的值。属性值将被视为任意字符串,在某些情况下为整数。

These events come in a variety of formats, but I estimate the count of individual attributes to be about 5,000. Most events only contain values for about 100 attributes each. The attribute values are to be treated as arbitrary strings and, in some cases, integers.

事件将最终合并为单个时间系列。虽然他们有一些内部结构,没有其他事件的引用,我相信,这意味着我不需要一个对象DB或一些ORM系统。

The events will eventually be consolidated into a single time series. While they do have some internal structure, there are no references to other events, which - I believe - means that I don't need an object DB or some ORM system.

我的要求:


  • 开源许可 - 我可能需要稍微调整。

  • Open source license - I may have to tweak it a bit.

通过扩展到多个服务器可扩展性,但首先只使用一个系统。

Scalability by being able to expand to multiple servers, although only one system will be used at first.

C / C ++,Java和Python的成熟驱动程序/绑定。最好有一个与他人合作的许可证 - 我宁愿不承诺自己任何东西,因为一个技术决定。我认为大多数DB驱动程序在这里没有问题,但应该提到。

Mature drivers/bindings for C/C++, Java and Python. Preferrably with a license that plays well with others - I'd rather not commit myself to anything because of a technical decision. I think that most DB drivers do not have a problem here, but it should be mentioned, anyway.

Linux的可用性。

Availability for Linux.

这将是很好,但不是必要的,如果它也可用于Windows

It would be nice, but not necessary, if it was also available for Windows

我的理想数据库允许我通过单个查询从指定的时间段检索所有的事件。

My ideal DB for this would allow me to retrieve all the events from a specified time period with a single query.

我发现/ :


  • Postgresql 增加页面大小在每个表中显然最多可以有6,000列。如果我的估计属性计数没有关闭,它可能会做。

  • Postgresql with an increased page size can apparently have up to 6,000 columns in each table. If my estimate of the attribute count is not off, it might do.

MySQL 似乎每个表的限制为4,000列。

MySQL seems to have a limit of 4,000 columns per table. I could use multiple tables with a bit of SQL-fu, but I'd rather not.

http://www.mongodb.org/rel =noreferrer> MongoDB 是我目前倾向的。这将允许我保持事件的内部结构,同时仍然能够查询它们。它的API也似乎很直接。

MongoDB is what I am currently leaning towards. It would allow me to preserve the internal structure of the events, while still being able to query them. Its API also seems quite straight-forward. I have no idea how well it does performance-wise though - at least on a single server.

OpenTSDB 及其度量收集框架听起来很有趣。我可以为每个属性使用单个时间系列(这可能有助于我的一些处理),将属性值作为标签,并附加地标记条目以将它们与特定事件相关联。它可能有一个更陡峭的准备曲线,上面的三个,从管理员和应用程序员的观点。

OpenTSDB and its metric collection framework sounds interesting.I could use a single time series for each attribute (which might help with some of my processing), have the attribute value as a tag and additionally tag the entries to associate them to a specific event. It probably has a steeper preparation curve that the three above, both from an administrator and an application programmer point of view. No idea about its performance.

使用 HBase 直接。这可能比 OpenTSDB 更适合我的要求 - 虽然 - 从我过去的hadoop经验 - 管理开销可能仍然高于前三个选项。

Use HBase directly. This might fit my requirements better than OpenTSDB, although - judging from my past experience with hadoop - the administration overhead is probably still higher than the first three options.

可能还有其他数据库可以做到,我知道 - 我非常感谢任何可能帮助我的建议或评论。

There are probably other databases that could do it, so feel free to let me know - I would appreciate any suggestion or comment that might help me with this.

PS:我只有极少的DB管理员经验,所以我对任何误解。

PS: I only have minimal experience as a DB administrator, so I apologise for any misconceptions.

推荐答案

使用具有数千列的表是疯狂。特别是当他们大多数是零,你说。

Using tables with thousands of columns is madness. Especially when most of them are zero as you said.

您应该首先考虑转换您的数据结构:

You should first look into converting your data-structure from this:

table_1
-------
event_id
attribute_1
attribute_2
[...]
attribute_5000

变成这样:

table_1          event_values             attributes
--------         ------------             ----------
event_id         event_id                 attribute_id
                 attribute_id             attribute_type
                 attribute_value

可以与任何RDMS一起使用总数据库大小和性能)

which can be used with any RDMS (your only constraint then would be the total database size and performance)

这篇关于时间序列事件的数据库建议的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆