Azure表存储分区设计 [英] Design of Partitioning for Azure Table Storage

查看:77
本文介绍了Azure表存储分区设计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些软件可以长时间收集数据,大约每秒200个读数.为此,它使用一个SQL数据库.我希望使用Azure将许多旧的存档"数据移至其中.

该软件使用多租户类型的体系结构,因此我计划每个租户使用一个Azure表.每个租户可能正在监视10-20个不同的指标,因此我计划使用指标ID(int)作为分区密钥.

由于每个指标每分钟最多只能读取一个读数,因此我打算将DateTime.Ticks.ToString("d19")用作我的RowKey.

但是,我对这将如何扩展缺乏一点了解;所以希望有人可以解决这个问题:

为了提高性能,Azure会/可能会按分区键拆分我的表,以使事情保持良好状态和快速性.在这种情况下,这将导致每个指标一个分区.

但是,我的rowkey可能表示大约5年的数据,因此我估计大约有250万行.

Azure是否足够聪明,然后还可以基于行键进行拆分,还是我正在设计未来的瓶颈?我知道通常不会过早地优化,但是像Azure这样的东西似乎并不像往常一样明智!

正在寻找Azure专家让我知道我是否正确,或者是否也应该将数据分区到更多表中.

解决方案

评论很少:

除了存储数据之外,您可能还希望研究如何检索数据,因为这可能会极大地改变您的设计.您可能想问自己一些问题:

  • 检索数据时,我是否总是会针对特定指标和日期/时间范围检索数据?
  • 还是我需要检索特定日期/时间范围内所有指标的数据?如果是这种情况,那么您正在查看全表扫描.显然,您可以通过执行多个查询(一个查询/PartitionKey)来避免这种情况
  • 我需要首先查看最新结果还是我不在乎.如果是以前版本,那么您的RowKey策略应该类似于(DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19").

此外,由于PartitionKey是一个字符串值,因此您可能希望将int值转换为带有一些"0"预填充的string值,以便所有ID都按顺序显示,否则得到1,10,11 ,..,19、2,...等.

据我所知,Windows Azure仅根据PartitionKey而不是RowKey对数据进行分区.在分区内,RowKey用作唯一键. Windows Azure将尝试在同一节点中使用相同的PartitionKey保留数据,但是由于每个节点都是物理设备(因此具有大小限制),因此数据也可能流到另一个节点.

您可能想阅读Windows Azure存储团队的这篇博客文章: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability -targets.aspx .该文档指出:

单个表分区–一个表分区是一个表分区中的所有实体 具有相同分区键值的表,通常表有很多 分区.单个表分区的吞吐量目标为:

  • 每秒最多2,000个实体
  • 注意,这是针对单个分区而不是单个表的.因此,具有良好分区的表最多可以处理 20,000个实体/秒,这是描述的总体帐户目标 以上.

现在您提到您有10-20个不同的指标点,对于每个指标点,每分钟最多可写1条记录,这意味着您最多可写20个实体/分钟/表,即远低于2000个实体/秒的可扩展性目标.

现在问题仍然存在于阅读中.假设用户每个分区最多可以读取24小时的数据(即24 * 60 = 1440点).现在假设用户在1天之内获得了所有20个指标的数据,那么每个用户(因此每个表)将最多获取28,800个数据点.我想剩下的问题是您每秒可以达到此阈值的请求数量.如果您能以某种方式推断这些信息,那么我认为您可以得出有关体系结构可伸缩性的结论.

我还建议您也观看此视频: http://channel9.msdn.com/Events/Build/2012/4-004 .

希望这会有所帮助.

I have some software which collects data over a large period of time, approx 200 readings per second. It uses an SQL database for this. I am looking to use Azure to move a lot of my old "archived" data to.

The software uses a multi-tenant type architecture, so I am planning to use one Azure Table per Tenant. Each tenant is perhaps monitoring 10-20 different metrics, so I am planning to use the Metric ID (int) as the Partition Key.

Since each metric will only have one reading per minute (max), I am planning to use DateTime.Ticks.ToString("d19") as my RowKey.

I am lacking a little understanding as to how this will scale however; so was hoping somebody might be able to clear this up:

For performance Azure will/might split my table by partitionkey in order to keep things nice and quick. This would result in one partition per metric in this case.

However, my rowkey could potentially represent data over approx 5 years, so I estimate approx 2.5 million rows.

Is Azure clever enough to then split based on rowkey as well, or am I designing in a future bottleneck? I know normally not to prematurely optimise, but with something like Azure that doesn't seem as sensible as normal!

Looking for an Azure expert to let me know if I am on the right line or whether I should be partitioning my data into more tables too.

解决方案

Few comments:

Apart from storing the data, you may also want to look into how you would want to retrieve the data as that may change your design considerably. Some of the questions you might want to ask yourself:

  • When I retrieve the data, will I always be retrieving the data for a particular metric and for a date/time range?
  • Or I need to retrieve the data for all metrics for a particular date/time range? If this is the case then you're looking at full table scan. Obviously you could avoid this by doing multiple queries (one query / PartitionKey)
  • Do I need to see the most latest results first or I don't really care. If it's former, then your RowKey strategy should be something like (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19").

Also since PartitionKey is a string value, you may want to convert int value to a string value with some "0" prepadding so that all your ids appear in order otherwise you'll get 1, 10, 11, .., 19, 2, ...etc.

To the best of my knowledge, Windows Azure partitions the data based on PartitionKey only and not the RowKey. Within a Partition, RowKey serves as unique key. Windows Azure will try and keep data with the same PartitionKey in the same node but since each node is a physical device (and thus has size limitation), the data may flow to another node as well.

You may want to read this blog post from Windows Azure Storage Team: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx.

UPDATE Based on your comments below and some information from above, let's try and do some math. This is based on the latest scalability targets published here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. The documentation states that:

Single Table Partition– a table partition are all of the entities in a table with the same partition key value, and usually tables have many partitions. The throughput target for a single table partition is:

  • Up to 2,000 entities per second
  • Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the 20,000 entities/second, which is the overall account target described above.

Now you mentioned that you've 10 - 20 different metric points and for for each metric point you'll write a maximum of 1 record per minute that means you would be writing a maximum of 20 entities / minute / table which is well under the scalability target of 2000 entities / second.

Now the question remains of reading. Assuming a user would read a maximum of 24 hours worth of data (i.e. 24 * 60 = 1440 points) per partition. Now assuming that the user gets the data for all 20 metrics for 1 day, then each user (thus each table) will fetch a maximum 28,800 data points. The question that is left for you I guess is how many requests like this you can get per second to meet that threshold. If you could somehow extrapolate this information, I think you can reach some conclusion about the scalability of your architecture.

I would also recommend watching this video as well: http://channel9.msdn.com/Events/Build/2012/4-004.

Hope this helps.

这篇关于Azure表存储分区设计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆