将 20[TB] 数据存储为向量/数组的 NoSql 解决方案? [英] NoSql solution to store 20[TB] of data, as vector/array?

查看:55
本文介绍了将 20[TB] 数据存储为向量/数组的 NoSql 解决方案?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要建立一个系统来有效地存储 &维护大量 (20 [TB]) 数据(并能够以向量"形式访问它).这是我的尺寸:

I need to build a system to efficiently store & maintain a huge amount (20 [TB]) of data (and be able to access it in 'vector' form). Here are my dimensions:

(1) 时间(以 YYYYMMDDHHMMSS 形式的整数给出)

(2) 字段(任意给定长度的字符串,代表医院名称)

(3) instrumentID(一个整数,表示工具的唯一ID)

我需要一种能够单独存储数据的方法,例如:

I will need a way to be able to store data individually, meaning, something like:

STORE 23789.46 作为instrumentID = 5 on field = 'Nhsdg' on time = 20040713113500

但是,我需要以下查询来运行FAST:给我时间戳Y"上字段X"的所有工具.

Yet, I would need the following query to to run FAST: give me all instruments for field 'X' on timestamp 'Y'.

为了构建这些系统,我得到了 60 台双核机器(每台机器有 1GB 内存,1.5TB 磁盘)

In order to build these systems, I am given 60 duo-core machines (each with 1GB of RAM, 1.5TB disk)

对合适的 NoSQL 解决方案有什么建议(最好与 Python 一起使用)?

Any recommendation on a suitable NoSQL soltuion (that would ideally work with python)?

注意:系统将首先存储历史数据(大约 20[TB]).每天我最多只会添加大约 200[MB].我只需要一个可以扩展的解决方案.我的用例只是一个简单的查询:在时间戳Y"上给我字段X"的所有工具

NOTE: the system will first store historical data (which is roughly 20[TB]). Every day I will add just about 200[MB] at most. I just need a solution that would scale and scale. my use case would be just a simple query: give me all instruments for field 'X' on timestamp 'Y'

推荐答案

MongoDB 可扩展并支持许多您通常会在 RDBMS 中找到的索引功能,例如 复合键索引.您可以对数据中的名称和时间属性使用复合索引.然后,您可以检索具有特定名称和日期范围的所有仪器读数.

MongoDB scales awesomely and supports many of the indexing features you'd typically find in an RDBMS such as compound key indexes. You can use a compound index on the name and time attributes in your data. Then you can retrieve all instrument readings with a particular name and date range.

[现在在简单的情况下,您只对一个基本查询完全感兴趣,而对其他任何事情都不感兴趣,您可以将名称和时间戳组合起来并调用您的键,这可以在任何键值存储中使用...]

[Now in the simple case where you're strictly interested in just that one basic query and nothing else, you can just combine the name and timestamp and call that your key, which would work in any key-value store...]

HBase 是另一个很好的选择.您可以使用 名称和日期上的复合行键.

HBase is another excellent option. You can use a composite row key on name and date.

正如其他人提到的,您绝对可以使用关系数据库.MySQL &PostgreSQL 当然可以处理负载,并且 表分区 在这种情况下可能是可取的以及因为您处理时间范围.您可以使用批量加载(并在加载期间禁用索引)来减少插入时间.

As others have mentioned, you can definitely use a relational database. MySQL & PostgreSQL can certainly handle the load and table partitioning might be desirable in this scenario as well since your dealing with time ranges. You can use bulk loading (and disabling indexes during loading) to decrease insertion time.

这篇关于将 20[TB] 数据存储为向量/数组的 NoSql 解决方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆