如何对用于存储日志的Azure表进行分区 [英] How to partition Azure tables used for storing logs

查看:93
本文介绍了如何对用于存储日志的Azure表进行分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们最近更新了日志记录以使用Azure表存储,这是因为按行和分区查询时它的低成本和高性能非常适合此目的.

We have recently updated our logging to use Azure table storage, which owing to its low cost and high performance when querying by row and partition is highly suited to this purpose.

我们正在尝试遵循文档设计可伸缩性Azure表存储的分区策略.当我们在此表中进行大量插入操作时(随着比例的增加,希望插入的数量也在增加),我们需要确保不会超出极限,从而导致日志丢失.我们的设计结构如下:

We are trying to follow the guidelines given in the document Designing a Scalable Partitioning Strategy for Azure Table Storage. As we are making a great number of inserts to this table (and hopefully an increasing number, as we scale) we need to ensure that we don't hit our limits resulting in logs being lost. We structured our design as follows:

  • 我们在每个环境(DEV,TEST,PROD)中都有一个Azure存储帐户.

  • We have a Azure storage account per environment (DEV, TEST, PROD).

我们每种产品都有一张桌子.

We have a table per product.

我们使用TicksReversed + GUID作为行键,以便我们可以 在某些时间查询结果块较高 表现.

We are using a TicksReversed+GUID for the Row Key, so that we can query blocks of results between certain times with a high performance.

我们最初选择通过Logger对表进行分区,这对我们来说 是产品的广泛领域,例如API,应用程序,性能 和缓存.但是,由于分区数量少,我们 担心这导致所谓的热"分区,其中许多 在给定的时间段内对一个分区执行插入操作.所以 我们更改为在Context上进行分区(对于我们来说,是类名或API 资源).

We originally chose to partition the table by Logger, which for us were broad areas of the product such as API, Application, Performance and Caching. However, due to the low numbers of partitions we were concerned that this resulted in so-called "hot" partitions where many inserts were performed on one partition in a given time period. So we changed to partition on Context (for us, the class name or API resource).

但是,实际上我们发现这并不理想,因为当我们一目了然地查看日志时,我们希望它们按时间顺序显示.相反,我们最终将结果块按上下文分组,如果要按时间对它们进行排序,就必须获取所有分区.

However, in practice we have found this is less than ideal, because when we look at our logs at a glance we would like them to appear in order of time. We instead end up with blocks of results grouped by context, and we would have to get all partitions if we want to order them by time.

我们曾经有过一些想法

  • 使用时间块(例如1小时)作为分区键,以按时间对它们进行排序(导致热分区1小时)

  • use blocks of time (say 1 hour) for partition keys to order them by time (results in hot partitions for 1 hour)

使用一些随机的GUID作为分区键来尝试分发日志(我们失去了快速查询诸如Context之类的功能的能力).

use a few random GUIDs for partition keys to try to distribute the logs (we lose the ability to query quickly on features such as Context).

由于这是Azure表存储的常见应用程序,因此必须有某种标准过程. 对用于存储日志的Azure表进行分区的最佳实践是什么?

As this is such a common application of Azure table storage, there must be some sort of standard procedure. What is the best practice for partitioning Azure tables that are used for storing logs?

  • 使用便宜的Azure存储(表存储似乎是显而易见的选择)

  • Use cheap Azure storage (Table Storage seems the obvious choice)

快速,可扩展的写入

丢失日志的机会很小(例如,通过超过Azure表存储中每秒2000个实体的分区写入速度).

Low chance of lost logs (i.e. by exceeding the partition write rate of 2000 entities per second in Azure table storage).

按日期排序,最近一次.

Reading ordered by date, most recent first.

如果可能的话,将其划分为对查询有用的内容(例如产品区域).

If possible, to partition on something that would be useful to query (such as product area).

推荐答案

根据我的经验,我遇到过类似的情况:

I have come across similar situation you encountered, based on my experience I could say:

每当在azure存储表上激发查询时,如果没有提供适当的分区键,它将执行全表扫描.换句话说,存储表在分区键上建立索引,正确地对数据进行分区是获得快速结果的关键.

Whenever a query is fired on an azure storage table, it does a full table scan if a proper partition key is not provided. In other words, storage table is indexed on Partition key and partitioning the data properly is the key to get fast results.

也就是说,现在您将不得不考虑要在表上触发哪种查询.例如某个产品在一段时间内发生的日志等.

That said, now you will have to think on what kind of queries you would fire on the table. Such as Logs occurred during a time period, for a product etc.

一种方法是使用最高小时精度的反向刻度,而不是将精确刻度用作分区键的一部分.这样,可以根据此分区键查询一个小时的数据量.根据每个分区的行数,您可以将精度更改为一天.同样,将相关数据存储在一起是明智的,这意味着每种产品的数据都将放在不同的表中.这样,您可以减少分区数和每个分区中的行数.

One way is to use reverse ticks up to hour precision instead of using the exact ticks as part of Partition Key. That way an hour worth of data can be queried based on this partition key. Depending on the number of rows which fall in to each partition, you could change the precision to a day. Also, it will be wise to store related data together, that means data for each product would go to a different table. That way you can reduce the number of partitions and number of rows in each partition.

基本上,请确保您事先知道分区键(精确或范围)并针对此类特定分区键进行查询以更快地获得结果.

Basically, ensure that you know the partition keys in advance (exact or range) and fire queries against such specific partition keys to get results faster.

要加快写入表的速度,可以使用批处理操作.尽管批处理中的一个实体失败,但要小心,整个批处理操作都会失败.正确的重试和错误检查可以将您保存在这里.

To speed up writing to table, you can use Batch Operation. Be cautious though as if one entity on the batch fails whole batch operation fails. Proper retry and error checking can save you here.

同时,您可以使用Blob存储来存储大量相关数据.这个想法是将一大堆相关的序列化数据存储为一个Blob.您可以单击一个这样的Blob以获取其中的所有数据,并在客户端进行进一步的投影.例如,一个产品一小时的数据价值将流到一个Blob,您可以设计一个特定的Blob前缀命名模式,并在需要时命中确切的Blob.这将帮助您更快地获取数据,而不是对每个查询进行表扫描.

At the same time, you could use blob storage to store lot of related data. The idea is to store a chunk of related serialized data as one blob. You can hit one such blob to get all the data in it and do further projections on the client side. For example, an hour worth of data for a product would go to a blob, you can devise a specific blob prefix naming pattern and hit the exact blob when needed. This will help you get your data pretty fast rather than doing a table scan for each query.

我使用了Blob方法,并且已经使用了两年没有麻烦.我将集合转换为IList<IDictionary<string,string>>,并使用二进制序列化和Gzip来存储每个blob.我使用基于Reflection.Emmit的辅助方法来快速访问实体属性,因此序列化和反序列化不会对CPU和内存造成损害.

I used the blob approach and have been using it for couple of years with no troubles. I convert my collection to IList<IDictionary<string,string>> and use binary serialization and Gzip for storing each blob. I use Reflection.Emmit based helper methods to access entity properties pretty fast so serialization and deserialization doesn't take a toll on the CPU and memory.

将数据存储在Blob中有助于我以更少的成本存储更多的数据,并更快地获取数据.

Storing data in blobs help me store more for less and get my data faster.

这篇关于如何对用于存储日志的Azure表进行分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆