将大规模有序时间序列数据存储在大表衍生物中 [英] storing massive ordered time series data in bigtable derivatives

查看:177
本文介绍了将大规模有序时间序列数据存储在大表衍生物中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想弄清楚这些新的数据存储,例如bigtable,hbase和cassandra是真正的。

I am trying to figure out exactly what these new fangled data stores such as bigtable, hbase and cassandra really are.

我使用大量的股票市场数据,数十亿行的价格/报价数据,每天可以加上100亿吉字节(尽管这些文本文件通常压缩至少一个数量级)。这个数据基本上是一些数字,两个或三个短字符串和一个时间戳(通常是毫秒级别)。如果我必须为每一行选择一个唯一的标识符,我必须选择整行(因为交换可能在同一毫秒内为同一个符号生成多个值)。

I work with massive amounts of stock market data, billions of rows of price/quote data that can add up to 100s of gigabytes every day (although these text files often compress by at least an order of magnitude). This data is basically a handful of numbers, two or three short strings and a timestamp (usually millisecond level). If I had to pick a unique identifier for each row, I would have to pick the whole row (since an exchange may generate multiple values for the same symbol in the same millisecond).

我假设将这个数据映射到bigtable(我包括它的导数)的最简单的方法是通过符号名和日期(这可能返回一个非常大的时间序列,超过百万个数据点是不可预见的)。从阅读它们的描述,看起来像多个键可以与这些系统一起使用。我还假定十进制数不是键的好候选。

I suppose the simplest way to map this data to bigtable (I'm including its derivatives) is by symbol name and date (which may return a very large time series, more than million data points isn't unheard of). From reading their descriptions, it looks like multiple keys can be used with these systems. I'm also assuming that decimal numbers are not good candidates for keys.

其中一些系统(例如Cassandra)声称能够进行范围查询。我可以有效地查询MSFT的所有值,对于给定的一天,上午11:00到下午1:30之间?

Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?

如果我想要,在一个给定的日期搜索所有符号,并请求价格在$ 10和$ 10.25之间的所有符号(所以我正在搜索的值,并希望作为结果返回的键)?

What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?

如果我想得到两次系列,从另一个减去一个,并返回两次系列和他们的结果,我必须在自己的程序中做他的逻辑如何?

What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?

阅读相关论文似乎表明这些系统不是很适合大规模时间序列系统。然而,如果谷歌地图等系统是基于它们,我认为时间序列应该工作,以及。例如,将时间视为x轴,将价格视为y轴,并将符号作为命名位置 - 突然之间看起来像是一个可以存储时间序列的理想存储(如果整个地球都可以存储,检索,缩放和注释,股票市场数据应该是微不足道的)。

Reading relevant papers seems to show that these systems are not a very good fit for massive time series systems. However, if systems such as google maps are based on them, I think time series should work as well. For example, think of time as the x-axis, prices as y-axis and symbols as named locations--all of a sudden it looks like bigtable should be the ideal store for time series (if the whole earth can be stored, retrieved, zoomed and annotated, stock market data should be trivial).

有些专家可以指出正确的方向或清除任何误解。

Can some expert point me in the right direction or clear up any misunderstandings.

感谢

推荐答案

我不是一个专家,但我已经玩Cassandra几天了,我有一些答案给你:

I am not an expert yet, but I've been playing with Cassandra for a few days now, and I have some answers for you:


  1. 不要担心数据量,它与Cassandra等系统无关,如果你有$$




其中一些系统(例如Cassandra)声称能够做范围查询。我可以在上午11点到下午1点30分之间有效地查询MSFT的所有值吗?

Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?

Cassandra在你知道如何使用键时非常有用。它可以快速通过键快速。因此,要在11:00到1:30 pm之间搜索MSFT,您必须键入以下行:

Cassandra is very useful when you know how to work with keys. It can swift through keys very quickly. So to search for MSFT between 11:00 and 1:30pm, you'd have to key your rows like this:

MSFT时间戳,GOOG时间戳。 etc
然后你可以告诉Cassandra找到所有以MSFT开头的键,并以MSFT-now + 1小时结束。

MSFT-timestamp, GOOG-timestamp , ..etc Then you can tell Cassandra to find all keys that start with MSFT-now and end with MSFT-now+1hour.


如果我想搜索某一天的所有符号,并请求价格在$ 10到$ 10.25之间的所有符号(因此我正在搜索这些值,并希望返回结果的键),该怎么办?

What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?

我不是一个专家,但到目前为止,我意识到Cassandra不是通过值搜索。所以,如果你想做上面的,你将不得不另一个表专门只是为了这个问题,并设计你的模式适合case。但它不会与我上面描述的有很大的不同。这一切都是关于命名你的键和列。 Cassandra可以很快找到它们!

I am not an expert, but so far I realized that Cassandra doesn't' search by values at all. So if you want to do the above, you will have to make another table dedicated just to this problem and design your schema to fit the case. But it won't be much different from what I described above. It's all about naming your keys and columns. Cassandra can find them very quickly!


如果我想得到两次系列,从另一个减去一个,并返回两次系列和他们的结果,我必须在我自己的程序中做他的逻辑吗?

What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?

正确,所有的逻辑是在你的程序内。这不是MySQL。这只是一个存储引擎。 (但我相信下一个版本会提供这些东西)

Correct, all logic is done inside your program. This is not MySQL. This is just a storage engine. (But I am sure the next versions will offer these sort of things)

请记住,我是一个新手,如果我错了,随时纠正

Please remember, that I am a novice at this, if I am wrong, feel free to correct me.

这篇关于将大规模有序时间序列数据存储在大表衍生物中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆