为什么cosmos db为相同的分区键值创建5个分区? [英] Why is cosmos db creating 5 partitions for a same partition key value?

查看:76
本文介绍了为什么cosmos db为相同的分区键值创建5个分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用Cosmos DB SQL API,这是具有以下内容的集合XYZ:

We are using Cosmos DB SQL API and here's a collection XYZ with:

大小:无限制
吞吐量: 50000 RU/s
分区键:散列

Size: Unlimited
Throughput: 50000 RU/s
PartitionKey: Hashed

我们正在插入200,000条记录,每个记录的大小约为2.1 KB,并且分区键列的值相同.据我们所知,所有具有相同分区键值的文档都存储在同一逻辑分区中,无论我们是固定大小集合还是无限大小集合,逻辑分区的大小都不应超过10 GB.

We are inserting 200,000 records each of size ~2.1 KB and having same value for a partition key column. Per our knowledge all the docs with same partition key value are stored in the same logical partition, and a logical partition should not exceed 10 GB limit whether we are on fixed or unlimited sized collection.

很显然,我们的总数据甚至还不到0.5 GB.但是,在Azure Cosmos DB(在门户中)的指标刀片中,它表示:

Clearly our total data is not even 0.5 GB. However, in the metrics blade of Azure Cosmos DB (in portal), it says:

集合XYZ具有5个分区键范围.预配置的吞吐量为 均匀地分布在这些分区上(每个分区10000 RU/s).

Collection XYZ has 5 partition key ranges. Provisioned throughput is evenly distributed across these partitions (10000 RU/s per partition).

这与我们到目前为止从MSFT文档中研究的内容不符.我们错过了什么吗?为什么要创建这5个分区?

This does not match with what we have studied so far from the MSFT docs. Are we missing something? Why are these 5 partitions created?

推荐答案

使用Unlimited集合大小时,默认情况下会为您提供5个物理分区键范围.此数字可以更改,但是截至2018年5月,默认值为5.您可以将每个物理分区视为一个服务器".因此,您的数据将散布在5个物理服务器"之间.随着数据大小的增长,您的数据将自动针对更多的物理分区进行重新分配.这就是为什么在设计中预先获得正确的分区键如此重要的原因.

When using the Unlimited collection size, by default you will be provisioned 5 physical partition key ranges. This number can change, but as of May 2018, 5 is the default. You can think of each physical partition as a "server". So your data will be spread amongst 5 physical "servers". As your data size grows, your data will automatically be re-distributed against more physical partitions. That's why getting partition key correct upfront in your design is so important.

对于所有200k记录,都具有相同的分区密钥(PK)的方案中的问题是,您将有热点.您有5个物理服务器",但将仅使用其中一个.其他4个将处于闲置状态,结果是在相同的价格点上,您的性能会下降.您需要支付50k RU/s的费用,但永远只能使用10k RU/s的费用.将您的PK更改为更均匀分布的内容.当然,这将取决于您读取数据的方式.如果您提供有关要存储的文档的更多详细信息,那么我们也许可以帮助您提出建议.如果您只是在进行点查找(每个文档ID调用ReadDocumentAsync()),则可以安全地对文档的ID字段进行分区.这将使您的所有200k文档分散在所有5个物理分区上,并且将使您的50k RU/s吞吐量最大化.有效地执行此操作后,您可能会发现可以将RU的使用量降低到更低的水平,并节省大量资金.每个记录只有200k记录,大小为2.1KB,您可能会低至2500 RU/s(您现在支付的成本的1/20).

The problem in your scenario of having the same Partition Key (PK) for all 200k records is that you will have hot spots. You have 5 physical "servers" but only one will ever be used. The other 4 will go idle, and the result is that you'll have less performance for the same price point. You're paying for 50k RU/s but will ever only be able to use 10k RU/s. Change your PK to something that is more uniformly distributed. This will vary of course how you read the data. If you give more detail about the docs you're storing then we may be able to help give a recommendation. If you're simply doing point lookups (calling ReadDocumentAsync() by each Document ID) then you can safely partition on the ID field of the document. This will spread all 200k of your docs across all 5 physical partitions and your 50k RU/s throughput will be maximized. Once you effectively do this, you will probably see that you can reduce the RU usage to something much lower and save a ton of money. With only 200k records each at 2.1KB, you probably could go low as 2500 RU/s (1/20th of the cost you're paying now).

*服务器用引号引起来,因为每个物理分区实际上是许多服务器的集合,这些服务器负载均衡以实现高可用性和吞吐量(取决于您的一致性级别).

*Server is in quotes because each physical partition is actually a collection of many servers that are load-balanced for high availability and also throughput (depending on your consistency level).

这篇关于为什么cosmos db为相同的分区键值创建5个分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆