对于具有动态变化维度的事实,良好的(== fast)存储策略? [英] Good (== fast) storage strategy for facts with dynamically evolving dimensions?

查看:119
本文介绍了对于具有动态变化维度的事实,良好的(== fast)存储策略?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在数据库中存储大量的计量数据。 记录由标识数据源,时间戳和值的id组成。记录随后通过ID及其时间戳记被检索。

I need to store large amounts of metering data in a database. A record consists of an id that identifies the data's source, a timestamp and a value. The records are later retrieved via the id and their timestamp.

根据我以前的经验(我正在开发一个有效使用的应用程序的后继者在过去五年中,磁盘i / o是数据检索的相关性能瓶颈。 (另见我的另一个问题我的< a>)。

According to my previous experience (I am developing the successor of an application that's been in productive use over the last five years), disk i/o is the relevant performance bottleneck for data retrieval. (See also this other question of mine).

由于我从来不寻找单行,而是始终为符合一系列ids和时间戳的(可能较大)的行组,这是一个非常明显的优化似乎是要存储更小的,压缩的数据块,这些数据被一个小得多的索引(例如,一个日数)访问,并且被解压缩并被过滤。

As I am never looking for single rows but always for (possibly large) groups of rows that match a range of ids and timestamps, a pretty obvious optimization seems to be to store larger, compressed chunks of data that are accessed by a much smaller index (e. g. by a day number) and is decompressed and filtered on the fly by the application.

我正在寻找的是确定将一部分数据放入一个块的最佳策略。在完美的世界中,每个用户请求将通过检索一大块数据并使用其大部分或全部来实现。所以我想最大限度地减少我为每个请求加载的块数量,我想最大限度地减少每个块的超量数据。

What I'm looking for is the best strategy for deciding what portion of the data to put in one chunk. In a perfect world, each user request would be fulfilled by retrieving one chunk of data and using most or all of it. So I want to minimize the amount of chunks I have to load for each request and I want to minimize excess data per chunk.

我将在下面发一个包含我的想法到目前为止,并使社区财产,所以你可以扩大它。当然,如果你有不同的方法,你可以发布自己的。

I'll post an answer below containing my ideas so far, and make it community property so you can expand on it. Of course, if you have a different approach, post your own.

ETA: S。 Lott 已经发布了这个答案,这有助于讨论甚至如果我不能直接使用它(见我的评论)。这里的观点是,我的事实的维度(并且应该是)受最终用户的影响,并随着时间的推移而改变。这是应用程序的核心功能,实际上是我首先解决了这个问题的原因。

ETA: S. Lott has posted this answer below, which is helpful to the discussion even if I can't use it directly (see my comments). The point here is that the "dimensions" to my "facts" are (and should be) influenced by the end user and change over time. This is a core feature of the app and actually the reason I wound up with this question in the first place.

推荐答案

p>与一系列ids和时间戳匹配的行组

"groups of rows that match a range of ids and timestamps"

您有两个维度:源和时间。我确定数据源有很多属性。时间,我知道,有很多属性(年,月,日,小时,星期几,星期,季度,财政期间等)。

You have two dimensions: the source and time. I'm sure the data source has lots of attributes. Time, I know, has a lot of attributes (year, month, day, hour, day of week, week of year, quarter, fiscal period, etc., etc.)

虽然您的事实具有只是ID和时间戳,但它们可能具有数据源维度和时间维度的FK。

While your facts have "just" an ID and a timestamp, they could have have FK's to the data source dimension and the time dimension.

作为星型模式查看,定位与一系列ids匹配的行组的查询可能更正确 - 是一组具有一个通用的数据源属性。它不是ID的随机集群,它是由您的维度的一些常见属性定义的ID集群。

Viewed as a star-schema, a query that locates "groups of rows that match a range of ids" may -- more properly -- be a group of rows with a common data source attribute. It isn't so much a random cluster of ID's, it's a cluster of ID's defined by some common attribute of your dimensions.

一旦定义了数据源的这些属性维度,您的分块策略应该会更加明显。

Once you define these attributes of the data source dimension, your "chunking" strategy should be considerably more obvious.

此外,您可能会发现某些数据库产品的位映射索引能力可以简单地存储你的事实在一个普通老式的桌子上,没有出汗的大块设计。

Further, you may find that the bit-mapped index capability of some database products makes it possible to simply store your facts in a plain-old table without sweating the chunk design at all.

如果位映射索引还不够快,那么也许你必须非规范化数据源属性分为维度和事实,然后在该维度属性上分割事实表。

If bit-mapped indexes still aren't fast enough, then perhaps, you have to denormalize the data source attributes into both dimension and fact, and then partition the fact table on this dimensional attribute.

这篇关于对于具有动态变化维度的事实,良好的(== fast)存储策略?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆