事件时间序列的数据库建议 [英] Database suggestions for time series of events

查看:10
本文介绍了事件时间序列的数据库建议的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我的一个项目,我必须将大量事件输入数据库以供以后处理,并且我正在尝试确定哪种 DBMS 最适合我的目的.

For one of my projects, I have to enter a big-ish collection of events into a database for later processing and I am trying to decide which DBMS would be best for my purpose.

我有:

  • 目前大约有 400,000,000 个离散事件

  • About 400,000,000 discrete events at the moment

大约 600 GB 的数据将存储在数据库中

About 600 GB of data that will be stored in the DB

这些事件有多种格式,但我估计单个属性的数量约为 5,000.大多数事件每个仅包含大约 100 个属性的值.属性值将被视为任意字符串,在某些情况下,还可以视为整数.

These events come in a variety of formats, but I estimate the count of individual attributes to be about 5,000. Most events only contain values for about 100 attributes each. The attribute values are to be treated as arbitrary strings and, in some cases, integers.

这些事件最终将合并为一个时间序列.虽然它们确实有一些内部结构,但没有对其他事件的引用,我相信这意味着我不需要对象数据库或一些 ORM 系统.

The events will eventually be consolidated into a single time series. While they do have some internal structure, there are no references to other events, which - I believe - means that I don't need an object DB or some ORM system.

我的要求:

  • 开源许可证 - 我可能需要稍微调整一下.

  • Open source license - I may have to tweak it a bit.

能够扩展到多台服务器的可扩展性,虽然一开始只会使用一个系统.

Scalability by being able to expand to multiple servers, although only one system will be used at first.

快速查询 - 更新并不那么重要.

Fast queries - updates are not that critical.

适用于 C/C++、Java 和 Python 的成熟驱动程序/绑定.最好拥有与他人合作良好的许可证 - 我宁愿不因为技术决定而承诺任何事情.我认为大多数DB驱动在这里都没有问题,但无论如何应该提到.

Mature drivers/bindings for C/C++, Java and Python. Preferrably with a license that plays well with others - I'd rather not commit myself to anything because of a technical decision. I think that most DB drivers do not have a problem here, but it should be mentioned, anyway.

适用于 Linux.

如果它也可用于 Windows,那就太好了,但不是必需的

It would be nice, but not necessary, if it was also available for Windows

我的理想数据库将允许我通过单个查询检索指定时间段内的所有事件.

My ideal DB for this would allow me to retrieve all the events from a specified time period with a single query.

到目前为止我发现/考虑的内容:

What I have found/considered so far:

  • Postgresql 随着页面大小的增加,每个表显然可以有多达 6,000 列.如果我对属性计数的估计没有偏差,它可能会.

  • Postgresql with an increased page size can apparently have up to 6,000 columns in each table. If my estimate of the attribute count is not off, it might do.

MySQL 似乎每个表有 4,000 列的限制.我可以使用带有一点 SQL-fu 的多个表,但我宁愿不这样做.

MySQL seems to have a limit of 4,000 columns per table. I could use multiple tables with a bit of SQL-fu, but I'd rather not.

MongoDB 是我目前倾向于的.它可以让我保留事件的内部结构,同时仍然能够查询它们.它的 API 看起来也很简单.我不知道它在性能方面的表现如何 - 至少在单个服务器上.

MongoDB is what I am currently leaning towards. It would allow me to preserve the internal structure of the events, while still being able to query them. Its API also seems quite straight-forward. I have no idea how well it does performance-wise though - at least on a single server.

OpenTSDB 和它的度量收集框架听起来很有趣.我可以为每个使用一个时间序列属性(这可能有助于我的某些处理),将属性值作为标签并另外标记条目以将它们与特定事件相关联.从管理员和应用程序程序员的角度来看,它可能具有比上述三个更陡峭的准备曲线.不知道它的性能.

OpenTSDB and its metric collection framework sounds interesting.I could use a single time series for each attribute (which might help with some of my processing), have the attribute value as a tag and additionally tag the entries to associate them to a specific event. It probably has a steeper preparation curve that the three above, both from an administrator and an application programmer point of view. No idea about its performance.

直接使用HBase.这可能比 OpenTSDB 更符合我的要求,尽管 - 从我过去使用 hadoop 的经验来看 - 管理开销可能是仍然高于前三个选项.

Use HBase directly. This might fit my requirements better than OpenTSDB, although - judging from my past experience with hadoop - the administration overhead is probably still higher than the first three options.

可能还有其他数据库可以做到这一点,所以请随时告诉我 - 如果有任何建议或评论可能对我有所帮助,我将不胜感激.

There are probably other databases that could do it, so feel free to let me know - I would appreciate any suggestion or comment that might help me with this.

PS:我作为数据库管理员的经验很少,因此对于任何误解,我深表歉意.

PS: I only have minimal experience as a DB administrator, so I apologise for any misconceptions.

推荐答案

使用包含数千列的表是疯狂的.尤其是当它们中的大多数如你所说的为零时.

Using tables with thousands of columns is madness. Especially when most of them are zero as you said.

您应该首先考虑从以下转换您的数据结构:

You should first look into converting your data-structure from this:

table_1
-------
event_id
attribute_1
attribute_2
[...]
attribute_5000

变成这样:

table_1          event_values             attributes
--------         ------------             ----------
event_id         event_id                 attribute_id
                 attribute_id             attribute_type
                 attribute_value

可以与任何 RDMS 一起使用(您唯一的限制就是总数据库大小和性能)

which can be used with any RDMS (your only constraint then would be the total database size and performance)

这篇关于事件时间序列的数据库建议的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆