最优秀的索引数据结构非常大的时间序列 [英] Best of breed indexing data structures for Extremely Large time-series

查看:150
本文介绍了最优秀的索引数据结构非常大的时间序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想问SO'ers的同事们关于最好的品种数据结构的意见,用于索引时间序列(又名列数据,又称平线性)。



基于抽样/离散特征,存在两种基本类型的时间序列:


  1. 定期离散化(每个样本都以普通频率拍摄)


  2. 不规则离散化(样本在任意时间点进行)


需要的查询:


  1. 所有值在时间范围[t0,t1]


  2. 时间范围[t0,t1]中的值大于/小于v0


  3. 在值范围[v0,v1]




  4. 数据集由总结的时间序列(哪种类型的不规则离散化)和多变量时间序列组成。所讨论的数据集的大小约为15-20TB,因此以分布式方式执行处理 - 因为上述某些查询将导致数据集大于任何一个系统上可用的内存的物理量。



    在这种情况下的分布式处理也意味着与时间序列查询一起分派所需的数据特定计算,以便计算可能尽可能靠近数据而发生 - 以减少节点到节点的通信(有些类似于映射/减少范例) - 在计算和数据的接近程度非常关键。



    索引的另一个问题应该能够应对的是,绝大多数数据是静态/历史(99.999 ...%),但是每天都添加新数据,想到现场参与者或市场数据。想法/要求是能够以尽可能低的延迟更新任何运行的计算(平均值,garch等),其中一些运行计算需要历史数据,其中一些将超过可合理缓存的数据。 / p>

    我已经考虑过HDF5,它对较小的数据集有效/有效,但随着数据集变大而开始拖动,同时也没有本地并行处理功能前端。



    寻找建议,链接,进一步阅读等(C或C ++解决方案,库)

    解决方案

    您可能想要使用某种类型的大型平衡树。像托比亚斯所说的,B树将是解决第一个问题的标准选择。如果您还要注意快速插入和更新,那么在麻省理工学院和CMU等地方,还有许多新的工作正在这些新的缓存无效B树中进行。有关这些事情的实现的一些讨论,请查看 Tokutek DB ,他们有很多好的演示文稿如下所示:



    http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf



    问题2和3一般都更难,因为它们涉及更高的尺寸范围搜索。这样做的标准数据结构将是范围树(其中给出了O(log ^ {d -1}(n))查询时间,代价为O(n log ^ d(n)))。你通常不会想使用一个k-d树这样的东西。虽然kd树具有最优的O(n)存储成本是真的,但是如果您只是在O(n ^ {(d-1)/ d}),则无法比O(n ^ {(d-1)/ d})更快地评估范围查询使用O(n)存储。对于d = 2,这将是O(sqrt(n))时间复杂度;坦白说,如果你有10 ^ 10个数据点(谁想要等待O(10 ^ 5)磁盘读取来完成一个简单的范围查询?)不要削减它。)



    幸运的是,这听起来像你的情况你真的不需要太担心一般情况。因为所有的数据都来自时间序列,所以每个时间坐标最多只能有一个值。假设,你可以做的只是使用一个范围查询来拉一些点的间隔,然后作为一个后期处理过程,并按照方向应用v约束。这将是我尝试的第一件事(在获得好的数据库实现之后),如果它工作,那么你完成了!如果您继续运行[t0,t1] x [-infty,+ infty]中的点数大于[t0中的点数]的数量级,那么尝试优化后两个查询才有意义,t1] x [v0,v1]。


    I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear).

    Two basic types of time-series exist based on the sampling/discretisation characteristic:

    1. Regular discretisation (Every sample is taken with a common frequency)

    2. Irregular discretisation(Samples are taken at arbitary time-points)

    Queries that will be required:

    1. All values in the time range [t0,t1]

    2. All values in the time range [t0,t1] that are greater/less than v0

    3. All values in the time range [t0,t1] that are in the value range[v0,v1]

    The data sets consist of summarized time-series (which sort of gets over the Irregular discretisation), and multivariate time-series. The data set(s) in question are about 15-20TB in size, hence processing is performed in a distributed manner - because some of the queries described above will result in datasets larger than the physical amount of memory available on any one system.

    Distributed processing in this context also means dispatching the required data specific computation along with the time-series query, so that the computation can occur as close to the data as is possible - so as to reduce node to node communications (somewhat similar to map/reduce paradigm) - in short proximity of computation and data is very critical.

    Another issue that the index should be able to cope with, is that the overwhelming majority of data is static/historic (99.999...%), however on a daily basis new data is added, think of "in the field senors" or "market data". The idea/requirement is to be able to update any running calculations (averages, garch's etc) with as low a latency as possible, some of these running calculations require historical data, some of which will be more than what can be reasonably cached.

    I've already considered HDF5, it works well/efficiently for smaller datasets but starts to drag as the datasets become larger, also there isn't native parallel processing capabilities from the front-end.

    Looking for suggestions, links, further reading etc. (C or C++ solutions, libraries)

    解决方案

    You would probably want to use some type of large, balanced tree. Like Tobias mentioned, B-trees would be the standard choice for solving the first problem. If you also care about getting fast insertions and updates, there is a lot of new work being done at places like MIT and CMU into these new "cache oblivious B-trees". For some discussion of the implementation of these things, look up Tokutek DB, they've got a number of good presentations like the following:

    http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf

    Questions 2 and 3 are in general a lot harder, since they involve higher dimensional range searching. The standard data structure for doing this would be the range tree (which gives O(log^{d-1}(n)) query time, at the cost of O(n log^d(n)) storage). You generally would not want to use a k-d tree for something like this. While it is true that kd trees have optimal, O(n), storage costs, it is a fact that you can't evaluate range queries any faster than O(n^{(d-1)/d}) if you only use O(n) storage. For d=2, this would be O(sqrt(n)) time complexity; and frankly that isn't going to cut it if you have 10^10 data points (who wants to wait for O(10^5) disk reads to complete on a simple range query?)

    Fortunately, it sounds like your situation you really don't need to worry too much about the general case. Because all of your data comes from a time series, you only ever have at most one value per each time coordinate. Hypothetically, what you could do is just use a range query to pull some interval of points, then as a post process go through and apply the v constraints pointwise. This would be the first thing I would try (after getting a good database implementation), and if it works then you are done! It really only makes sense to try optimizing the latter two queries if you keep running into situations where the number of points in [t0, t1] x [-infty,+infty] is orders of magnitude larger than the number of points in [t0,t1] x [v0, v1].

    这篇关于最优秀的索引数据结构非常大的时间序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆