在 bigtable 衍生品中存储大量有序的时间序列数据 [英] storing massive ordered time series data in bigtable derivatives

查看:21
本文介绍了在 bigtable 衍生品中存储大量有序的时间序列数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图弄清楚这些新奇的数据存储(例如 bigtable、hbase 和 cassandra)究竟是什么.

我处理大量股票市场数据、数十亿行的价格/报价数据,这些数据每天可以增加多达 100 千兆字节(尽管这些文本文件通常至少压缩一个数量级).这些数据基本上是一些数字、两三个短字符串和一个时间戳(通常是毫秒级).如果我必须为每一行选择一个唯一标识符,我将不得不选择整行(因为交换可能会在同一毫秒内为同一交易品种生成多个值).

我认为将这些数据映射到 bigtable(我包括它的衍生物)的最简单方法是通过符号名称和日期(这可能会返回一个非常大的时间序列,超过一百万个数据点并非闻所未闻).从阅读他们的描述来看,这些系统似乎可以使用多个键.我还假设十进制数不适合用作键.

其中一些系统(例如 Cassandra)声称能够进行范围查询.例如,我能否有效地查询 MSFT 的所有值,在给定的一天,在上午 11:00 到下午 1:30 之间?

如果我想搜索给定日期的所有交易品种,并请求价格在 10 美元到 10.25 美元之间的所有交易品种(因此我正在搜索值,并希望返回键作为结果)怎么办?

如果我想得到两个时间序列,从另一个中减去一个,然后返回两个时间序列及其结果,我是否必须在自己的程序中执行他的逻辑?

阅读相关论文似乎表明这些系统不太适合大规模时间序列系统.但是,如果谷歌地图等系统基于它们,我认为时间序列也应该有效.例如,将时间视为 x 轴,将价格视为 y 轴,将符号视为命名位置——突然之间看起来 bigtable 应该是时间序列的理想存储(如果可以存储整个地球,检索,放大和注释,股市数据应该是微不足道的).

专家能否指出我正确的方向或消除任何误解.

谢谢

解决方案

我还不是专家,但我已经和 Cassandra 玩了几天了,我有一些答案给你:

  1. 不要担心数据量,这与 Cassandra 之类的系统无关,如果您有大型硬件集群的成本.

<块引用>

其中一些系统(例如 Cassandra)声称能够进行范围查询.例如,我能否有效地查询 MSFT 的所有值,在给定的一天,在上午 11:00 到下午 1:30 之间?

当您知道如何使用密钥时,Cassandra 非常有用.它可以非常快速地通过键.因此,要在晚上 11:00 到 1:30 之间搜索 MSFT,您必须像这样键入行:

MSFT-timestamp, GOOG-timestamp, ..etc然后,您可以告诉 Cassandra 查找所有以 MSFT-now 开头并以 MSFT-now+1hour 结尾的键.

<块引用>

如果我想搜索给定日期的所有交易品种,并请求价格在 10 美元到 10.25 美元之间的所有交易品种(因此我正在搜索值,并希望返回键作为结果)怎么办?

我不是专家,但到目前为止我意识到 Cassandra 根本不按值搜索.因此,如果您想执行上述操作,则必须制作另一个专用于此问题的表,并设计您的架构以适应这种情况.但这与我上面描述的不会有太大不同.这完全是关于命名您的键和列.Cassandra 可以很快找到它们!

<块引用>

如果我想得到两个时间序列,从另一个中减去一个,然后返回两个时间序列及其结果,我是否必须在自己的程序中执行他的逻辑?

正确,所有逻辑都在您的程序中完成.这不是 MySQL.这只是一个存储引擎.(但我相信下一个版本会提供这些东西)

请记住,我是这方面的新手,如果我错了,请随时纠正我.

I am trying to figure out exactly what these new fangled data stores such as bigtable, hbase and cassandra really are.

I work with massive amounts of stock market data, billions of rows of price/quote data that can add up to 100s of gigabytes every day (although these text files often compress by at least an order of magnitude). This data is basically a handful of numbers, two or three short strings and a timestamp (usually millisecond level). If I had to pick a unique identifier for each row, I would have to pick the whole row (since an exchange may generate multiple values for the same symbol in the same millisecond).

I suppose the simplest way to map this data to bigtable (I'm including its derivatives) is by symbol name and date (which may return a very large time series, more than million data points isn't unheard of). From reading their descriptions, it looks like multiple keys can be used with these systems. I'm also assuming that decimal numbers are not good candidates for keys.

Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?

What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?

What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?

Reading relevant papers seems to show that these systems are not a very good fit for massive time series systems. However, if systems such as google maps are based on them, I think time series should work as well. For example, think of time as the x-axis, prices as y-axis and symbols as named locations--all of a sudden it looks like bigtable should be the ideal store for time series (if the whole earth can be stored, retrieved, zoomed and annotated, stock market data should be trivial).

Can some expert point me in the right direction or clear up any misunderstandings.

Thanks

解决方案

I am not an expert yet, but I've been playing with Cassandra for a few days now, and I have some answers for you:

  1. Don't worry about amount of data, it's irrelevant with systems like Cassandra, if you have $$$ for a large hardware cluster.

Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?

Cassandra is very useful when you know how to work with keys. It can swift through keys very quickly. So to search for MSFT between 11:00 and 1:30pm, you'd have to key your rows like this:

MSFT-timestamp, GOOG-timestamp , ..etc Then you can tell Cassandra to find all keys that start with MSFT-now and end with MSFT-now+1hour.

What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?

I am not an expert, but so far I realized that Cassandra doesn't' search by values at all. So if you want to do the above, you will have to make another table dedicated just to this problem and design your schema to fit the case. But it won't be much different from what I described above. It's all about naming your keys and columns. Cassandra can find them very quickly!

What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?

Correct, all logic is done inside your program. This is not MySQL. This is just a storage engine. (But I am sure the next versions will offer these sort of things)

Please remember, that I am a novice at this, if I am wrong, feel free to correct me.

这篇关于在 bigtable 衍生品中存储大量有序的时间序列数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆