静态时间序列数据的数据库解决方案 [英] Database solution for static time-series data

查看:179
本文介绍了静态时间序列数据的数据库解决方案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个庞大且不断增长的实验数据集,来自大约30,000个受试者。对于每个主题,有几个数据记录。在每个记录中,有几个时间系列的生理数据,每个约90秒长,并以250Hz采样。我应该注意,任何给定的时间序列实例从不扩展,只有额外的记录被添加到数据集。这些录音 的长度也相同。目前,每个记录的数据包含在其自己的平面文件中。这些文件按照整体实验版本,实验位置,日期和实验终端(按层次结构顺序)按层次分解的目录结构组织。



我们的大多数分析是在MATLAB中完成的,我们计划继续广泛使用MATLAB进一步分析。当所有研究人员在同一地点时,情况是可行的(如果不希望的话)。我们现在遍布全球,我正在调查最佳解决方案,使所有这些数据从远程位置提供。我熟悉MySQL和SQL Server,并且可以很容易地想出一种在这样的范例内构造这种数据的方法。然而,我对这种方法的效率持怀疑态度。我会重视任何可能指向我的正确方向的建议。我应该考虑不同的东西吗?时间序列数据库(虽然那些似乎在调整以扩展现有的时间序列)?还有其他什么?



分析不需要在线完成,虽然这样做的可能性是一个加分。现在,我们的典型用例是查询特定的记录子集,并拉下相关的时间序列进行本地分析。



更新:



在我的研究中,我发现本白皮书,其中他们存储和分析非常相似的信号。他们选择MongoDB的原因如下:




  • 开发速度


  • 通过MongoDB API本身可以方便地使用MapReduce



这些都是对我有吸引力的优势,以及。开发看起来很简单,用分析结果轻松地增加现有文档的能力显然是有帮助的(虽然我知道这在我已经熟悉的系统中并不难做到。)



要清楚,我知道我可以将数据存储在平面文件中,我知道我可以简单地通过MATLAB安全地访问这些平面文件,有 >众多原因我想将这些数据存储在数据库中,例如:




  • 平面文件

  • 有一个特定的日期,例如,每个终端的所有数据,没有办法查询与特定记录相关的元数据,我不禁想到我需要跳过来为女性主题提取所有数据的圈。



它的长短是,我想将这些数据存储在数据库中,有很多原因(空间,效率和易于访问的考虑,等等)。



更新2



我似乎没有充分描述这些数据的性质,所以我将尝试澄清。这些记录当然是时间序列数据,但不是以许多人想到的时间序列的方式。我不是连续捕获要附加到现有时间序列的数据。我真的做了多个录音,所有与不同的元数据,但相同的三个信号。这些信号可以被认为是数字的向量,并且这些向量的长度随着记录而变化。在传统的RDBMS中,我可能创建一个表用于记录类型A,一个用于B等,并将每个行视为时间序列中的数据点。但是,这不工作,因为记录长度不同。相反,我更喜欢有一个实体代表一个人,并且该实体与从该人采取的几个录音相关联。这是为什么我考虑MongoDB,因为我可以在集合中的一个对象内嵌套多个数组(不同长度)。



潜在的MongoDB结构

$ b作为一个例子,下面是我为一个主题的潜在MongoDB BSON结构草图:

  {
songs:
{
order:
[
R008,
R017,
T015
],
times:[
{
start:2012-07-02T17:38:56.000Z,
finish 2012-07-02T17:40:56.000Z,
duration:119188.445
},
{
start:2012-07-02T17:42: 22.000Z,
finish:2012-07-02T17:43:41.000Z,
duration:79593.648
},
{
start:2012-07-02T17:44:37.000Z,
finish:2012-07-02T17:46:19.000Z,
duration:102450.695
}
]
},
self_report:
{
music_styles:
{
none:false,
world:true
},
songs:
[
{
engagement:4,
positivity:4 ,
activity:3,
power:4,
chills:4,
like:4,
familiarity:4
},
{
engagement:4,
positivity:4,
activity:3,
power:4 ,
chills:4,
like:4,
familiarity:3
},
{
engagement:2 ,
positivity:1,
activity:2,
power:2,
chills:4,
like:1 ,
familiarity:1
}
],
most_engaged:1,
most_enjoyed:1,
emotion_indices:
[
0.729994,
0.471576,
28.9082
]
},
signals:
{
test:
{
timestamps:
[
0.010,0.010,0.021,...
],
eda:
[
149.200,149.200,149.200,...
],
pox:
[
86.957,86.957,86.957,...
]
},
songs:
[
{
timestamps:
[
0.010,0.010,0.021, ...
],
eda:
[
149.200,149.200,149.200,...
],
pox:
[
86.957,86.957,86.957,...
]
},
{
timestamps:
[
0.010,0.010,0.021,...
],
eda:
[
149.200,149.200,149.200,...
],
pox:
[
86.957,86.957,86.957,...
]
},
{
timestamps:
[
0.010,0.010,0.021,...
],
eda:
[
149.200,149.200,149.200,...
],
pox:
[
86.957,86.957,86.957,...
]
}
]
demographics:
{
gender:female,
dob:1980,
nationality:
musical_background:false,
musical_expertise:1,
impairedments:
{
hearing:false,
visual :false
}
},
timestamps:
{
start:2012-07-02T17:37:47.000Z,
test:2012-07-02T17:38:16.000Z,
end:2012-07-02T17:46:56.000Z
}
}

这些信号 p>

解决方案

很多时候,当人们来到NoSQL数据库时,他们来到它听说没有模式和生活是好的。但是,IMHO这是一个真正错误的概念。



在处理NoSQL时,你必须考虑聚合。通常,聚合体将是可以作为单个单元操作的实体。在你的情况下,一种可能的(但不是那么高效)的方式将模拟一个用户和他/她的数据作为一个单一的聚合。这将确保您的用户聚合可以是数据中心/分片不可知。但是如果数据将增长 - 加载用户也将加载所有相关数据,并成为内存池。 (Mongo因此在内存上有点贪婪)



另一种选择是将录音存储为一个聚合并链接回用户的id - this可以是一个合成键,您可以像GUID一样创建。即使这种表面看起来像一个联接,它只是一个查找属性 - 因为这里没有真正的参照完整性。这可能是我将采取的方法,如果文件将不断添加。



MongoDb发光的地方是你可以通过属性进行adhoc查询的部分在文档中(如果您不想在路上丢失头发,您将为此属性创建索引。)。你不会错过你选择的时间序列数据存储在Mongo。您可以在一个日期范围内提取符合ID的数据,例如,不做任何主要的特技。



请确保您有副本集,你采取,并勤奋地选择您的分片方法早期 - 分片后是没有乐趣。


We have a large and growing dataset of experimental data taken from around 30,000 subjects. For each subject, there are several recordings of data. Within each recording, there is a collection several time series of physiological data, each about 90 seconds long and sampled at 250Hz. I should note that any given instance of a time series is never extended, only additional recordings are added to the dataset. These recordings are not all of the same length, as well. Currently, the data for each recording is contained in its own flat file. These files are organized in a directory structure that is broken down hierarchically by version of the overall experiment, experiment location, date, and experiment terminal (in that hierarchical order).

Most of our analysis is done in MATLAB and we plan to continue to use MATLAB extensively for further analysis. The situation as it stands was workable (if undesirable) when all researchers were co-located. We are now spread around the globe and I am investigating the best solution to make all of this data available from remote locations. I am well-versed in MySQL and SQL Server, and could easily come up with a way to structure this data within such a paradigm. I am, however, skeptical as to the efficiency of this approach. I would value any suggestions that might point me in the right direction. Should I be considering something different? Time series databases (though those seem to me to be tuned for extending existing time series)? Something else?

Analysis does not need to be done online, though the possibility of doing so would be a plus. For now, our typical use case would be to query for a specific subset of recordings and pull down the associated time series for local analysis. I appreciate any advice you might have!

Update:

In my research, I've found this paper, where they are storing and analyzing very similar signals. They've chosen MongoDB for the following reasons:

  • Speed of development
  • The ease of adding fields to existing documents (features extracted from signals, etc.)
  • Ease of MapReduce use through the MongoDB API itself

These are all attractive advantages to me, as well. The development looks dead simple, and the ability to easily augment existing documents with the results of analysis is clearly helpful (though I know this isn't exactly difficult to do in the systems with which I am already familiar.

To be clear, I know that I can leave the data stored in flat files, and I know I could simply arrange for secure access to these flat files via MATLAB over the network. There are numerous reasons I want to store this data in a database. For instance:

  • There is little structure to the flat files now, other than the hierarchical structure stated above. It is impossible to pull all data from a particular day without pulling down all individual files for each terminal for a particular day, for instance.
  • There is no way to query against metadata associated with a particular recording. I shudder to think of the hoops I'd need to jump through to pull all data for female subjects, for example.

The long and short of it is that I want to store these data in a data base for myriad reasons (space, efficiency, and ease of access considerations, among many others).

Update 2

I seem to not be sufficiently describing the nature of these data, so I will attempt to clarify. These recordings are certainly time series data, but not in the way many people think of time series. I am not continually capturing data to be appended to an existing time series. I am really making multiple recordings, all with varying metadata, but of the same three signals. These signals can be thought of as a vector of numbers, and the length of these vectors vary from recording to recording. In a traditional RDBMS, I might create one table for recording type A, one for B, etc. and treat each row as a data point in the time series. However, this does not work as recordings vary in length. Rather, I would prefer to have an entity that represents a person, and have that entity associated with the several recordings taken from that person. This is why I have considered MongoDB, as I can nest several arrays (of varying lengths) within one object in a collection.

Potential MongoDB Structure

As an example, here's what I sketched as a potential MongoDB BSON structure for a subject:

{
    "songs": 
    {
        "order": 
        [
            "R008",
            "R017",
            "T015"
        ],
        "times": [
            { 
                "start": "2012-07-02T17:38:56.000Z",
                "finish": "2012-07-02T17:40:56.000Z",
                "duration": 119188.445
            },
            { 
                "start": "2012-07-02T17:42:22.000Z",
                "finish": "2012-07-02T17:43:41.000Z",
                "duration": 79593.648
            },
            { 
                "start": "2012-07-02T17:44:37.000Z",
                "finish": "2012-07-02T17:46:19.000Z",
                "duration": 102450.695
            }
        ] 
    },
    "self_report":
    {
        "music_styles":
        {
                "none": false,
                "world": true
        },
        "songs":
        [
            {
                "engagement": 4,
                "positivity": 4,
                "activity": 3,
                "power": 4,
                "chills": 4,
                "like": 4,
                "familiarity": 4
            },
            {
                "engagement": 4,
                "positivity": 4,
                "activity": 3,
                "power": 4,
                "chills": 4,
                "like": 4,
                "familiarity": 3
            },
            {
                "engagement": 2,
                "positivity": 1,
                "activity": 2,
                "power": 2,
                "chills": 4,
                "like": 1,
                "familiarity": 1
            }
        ],
        "most_engaged": 1,
        "most_enjoyed": 1,
        "emotion_indices":
        [
            0.729994,
            0.471576,
            28.9082
        ]
    },
    "signals":
    {
        "test":
        {
            "timestamps":
            [
                0.010, 0.010, 0.021, ...
            ],
            "eda":
            [
                149.200, 149.200, 149.200, ...
            ],
            "pox":
            [
                86.957, 86.957, 86.957, ...
            ]
        },
        "songs":
        [
            {
                "timestamps":
                [
                    0.010, 0.010, 0.021, ...
                ],
                "eda":
                [
                    149.200, 149.200, 149.200, ...
                ],
                "pox":
                [
                    86.957, 86.957, 86.957, ...
                ]
            },
            {
                "timestamps":
                [
                    0.010, 0.010, 0.021, ...
                ],
                "eda":
                [
                    149.200, 149.200, 149.200, ...
                ],
                "pox":
                [
                    86.957, 86.957, 86.957, ...
                ]
            },
            {
                "timestamps":
                [
                    0.010, 0.010, 0.021, ...
                ],
                "eda":
                [
                    149.200, 149.200, 149.200, ...
                ],
                "pox":
                [
                    86.957, 86.957, 86.957, ...
                ]
            }
        ]
    },
    "demographics":
    {
        "gender": "female",
        "dob": 1980,
        "nationality": "rest of the world",
        "musical_background": false,
        "musical_expertise": 1,
        "impairments":
        {
            "hearing": false,
            "visual": false
        }
    },
    "timestamps":
    {
        "start": "2012-07-02T17:37:47.000Z",
        "test": "2012-07-02T17:38:16.000Z",
        "end": "2012-07-02T17:46:56.000Z"
    }
}

Those signals are the time seria.

解决方案

Quite often when people come to NoSQL databases, they come to it hearing that there's no schema and life's all good. However, IMHO this is a really wrong notion.

When dealing with NoSQL, You have to think in terms of "aggregates" . Typically an aggregate would be an entity that can be operated on as a single unit. In your case one possible (but not that efficient) way will be to model an user and his/her data as a single aggregate. This will ensure that your user aggregate can be data centre / shard agnostic. But if the data is going to grow - loading a user will also load all the related data and be a memory hog. (Mongo as such is bit greedy on memory)

Another option will be to have the recordings stored as an aggregate and "linked" back to the user with an id - this can be a synthetic key that you can create like a GUID. Even though this superficially seems like a join, its just a "look up by property" - Since there's no real referential integrity here. This maybe the approach that I'll take if files are going to get added constantly.

The place where MongoDb shines is the part where you can do adhoc queries by a property in the document(you will create an index for this property if you don't want to lose hair later down the road.). You will not go wrong with your choice for time series data storage in Mongo. You can extract data that matches an id, within a date range for e.g., without doing any major stunts.

Please do ensure that you have replica sets no matter which ever approach you take, and diligently chose your sharding approach early on - sharding later is no fun.

这篇关于静态时间序列数据的数据库解决方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆