如何存储73亿行市场数据(优化以供阅读)? [英] How to store 7.3 billion rows of market data (optimized to be read)?

查看:122
本文介绍了如何存储73亿行市场数据(优化以供阅读)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

自1998年以来,我有一个包含1000个股票的1分钟数据的数据集,总计大约(2012-1998)*(365 * 24 * 60)* 1000 = 73亿行。



大多数(99.9%)的时间只能执行读取请求。



在db中存储数据的最佳方法是什么?




  • li>
  • 1000个表(每个股票代号一个),每行7.3M行。

  • 任何数据库引擎的建议? (我打算使用Amazon RDS的MySQL)



我不是用来处理这么大的数据集,所以这是一个极好的机会,让我学习。非常感谢您的帮助和建议。



编辑:



这是一个示例行:


'XX',20041208,938,43.7444,43.7541,43.735,43.7444,35116.7,1,0,0 / <$ p $ b

第1列是股票代码,第2列是日期,第3列是分钟,其余是开盘 - 最高价格,最低价格, 3个整数列。



大多数查询将类似于给我2012年4月12日12:15和2012年4月13日12:52之间的AAPL的价格关于硬件:我计划使用Amazon RDS,因此我非常灵活。

解决方案 div>

告诉我们有关查询和您的硬件环境。



我会非常想去 NoSQL ,使用



好吧,为什么?



首先,注意我问了查询。你不能 - 我们当然不能 - 回答这些问题,而不知道什么是工作负载。 (我偶尔会有一篇文章,很快就会出现,但我今天不能链接。)但是,问题的缩放让我想起远离一个大老数据库,因为




  • 我对类似系统的经验表明,访问将是大序列(计算某种时间序列分析)灵活数据挖掘(OLAP)。顺序数据可以更好和更快地顺序处理;


  • 如果你在做什么是有效的大运行反对


  • 如果你想做随机查询,特别是进行交叉比较,一个Hadoop系统可能是有效的。为什么?因为




    • 可以更好地利用相对较小的商品硬件上的并行性。

    • 高可靠性和冗余




>

但事实是,直到我们知道你的工作量,不可能说任何确定的。


I have a dataset of 1 minute data of 1000 stocks since 1998, that total around (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.

Most (99.9%) of the time I will perform only read requests.

What is the best way to store this data in a db?

  • 1 big table with 7.3B rows?
  • 1000 tables (one for each stock symbol) with 7.3M rows each?
  • any recommendation of database engine? (I'm planning to use Amazon RDS' MySQL)

I'm not used to deal with datasets this big, so this is an excellent opportunity for me to learn. I will appreciate a lot your help and advice.

Edit:

This is a sample row:

'XX', 20041208, 938, 43.7444, 43.7541, 43.735, 43.7444, 35116.7, 1, 0, 0

Column 1 is the stock symbol, column 2 is the date, column 3 is the minute, the rest are open-high-low-close prices, volume, and 3 integer columns.

Most of the queries will be like "Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"

About the hardware: I plan to use Amazon RDS so I'm flexible on that

解决方案

Tell us about the queries, and your hardware environment.

I would be very very tempted to go NoSQL, using Hadoop or something similar, as long as you can take advantage of parallelism.

Update

Okay, why?

First of all, notice that I asked about the queries. You can't -- and we certainly can't -- answer these questions without knowing what the workload is like. (I'll co-incidentally have an article about this appearing soon, but I can't link it today.) But the scale of the problem makes me think about moving away from a Big Old Database because

  • My experience with similar systems suggests the access will either be big sequential (computing some kind of time series analysis) or very very flexible data mining (OLAP). Sequential data can be handled better and faster sequentially; OLAP means computing lots and lots of indices, which either will take lots of time or lots of space.

  • If You're doing what are effectively big runs against many data in an OLAP world, however, a column-oriented approach might be best.

  • If you want to do random queries, especially making cross-comparisons, a Hadoop system might be effective. Why? Because

    • you can better exploit parallelism on relatively small commodity hardware.
    • you can also better implement high reliability and redundancy
    • many of those problems lend themselves naturally to the MapReduce paradigm.

But the fact is, until we know about your workload, it's impossible to say anything definitive.

这篇关于如何存储73亿行市场数据(优化以供阅读)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆