在CosmosDB中使用GUID的子字符串作为partitionkey是一个坏主意吗? [英] Would using a substring of a GUID in CosmosDB as partitionkey be a bad idea?

查看:114
本文介绍了在CosmosDB中使用GUID的子字符串作为partitionkey是一个坏主意吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一些研发工作,以将产品目录移动到CosmosDB中.

I'm doing some R&D to move a product catalog into CosmosDB.

用最简单的术语来说,产品文档将具有:

In it's simplest terms a Product document will have:

  • 产品ID(GUID)
  • 产品名称
  • 制造商

制造商将登录该系统,并且只能查询自己的数据,因此每次查询始终会有一个ManufacturerId = SINGLE_VALUE过滤器.

A manufacturer will log into this system and will only be able to query their own data so there will always be a ManufacturerId = SINGLE_VALUE filter on every query.

在查看cosmos文档时,请重新:选择正确的分区策略,似乎有2个要点. -选择具有高基数的分区键 -选择一个分区键,以使数据均匀分布.

When reviewing the cosmos docs, re: chosing the correct partition strategy, there seems to be 2 main points. - Choose a partition key with a high cardinality - Choose a partition key that gives an even distribution of data.

在我上面的场景中,将产品ID选择为PartitionKey会非常极端……每个逻辑分区1个文档. 另一方面,选择Manufactuer也不是一件好事,因为这不会导致平均分配(一些制造商有10种产品,其他制造商有100,000种产品)

In my scenario above, chosing product Id as the PartitionKey would be pretty extreme... 1 document per logical partition. On the other hand chosing Manufactuer wouldn't be great either since that won't result in an even distribution (some manufacturers have 10 products, others have 100,000)

确保分布均匀的一种方法是采用GUID的前4个字符并将其用作PartitionKey. (因此最多4096个分区).根据我拥有的现有数据集,这的确会导致数据的均匀分布.但我想知道这样做有什么弊端.

One way to ensure an even distribution would be to take the first 4 characters of the GUID and use that as a PartitionKey. (so max 4096 partitions). Based on the existing dataset i have, this does result in an even distribution of data. but I'm wondering are there any downsides to doing this.

仅将整个productId用作PartitionKey(每个分区1个文档)有任何不利之处,因为它们似乎表明这对于存储用户配置文件的系统是一种有效的方法.这种方法是否会对在同一搜索中搜索多个产品产生影响.

Are there any downsides to just using the entire productId as the PartitionKey (1 doc per partition) as they seem to indicate that's a valid approach for a system that stores user profiles. Would this approach have implications for searching for multiple products in the same search.

推荐答案

使用每个文档唯一的密钥是确保均匀分发以支持高性能的一个好方法-因此,完整的产品ID是一个不错的选择.我不相信使用完整guid的子字符串作为分区键不会获得任何好处-并且您将限制可用分区的最大数量.

Using a key that is unique per-document is a good way to ensure even distribution to support high performance - so that makes the full product id a great choice. I don't believe you would gain any advantage from using a substring of a full guid as a partition key - and you would be limiting your maximum number of usable partitions.

那么为什么不总是使用唯一标识符作为分区键呢?

So why not always use a unique identifier as the partition key?

首先,如果将分区键添加到查询中,则无需启用跨分区查询,并且总体查询成本(RU/s)较低.因此,如果您可以设计分区键来减少对跨分区查询的需求,则可以节省RU/s.我认为"guid的子字符串"对您没有帮助,因为guid的随机性不会以您可以利用其进行有效查询的方式来分发文档.

First, if you add a partition key to a query, you do not need to enable cross-partition query and you will have a lower overall query cost (RU/s). So if you can design your partition key to reduce your need for cross-partition queries it could save RU/s. I don't think a 'substring of a guid' helps you there, because the random nature of the guid would not distribute documents in a way you could take advantage of for efficient querying.

第二,如果需要将它们包含在事务存储过程中,则只有具有相同分区键的文档才能保证在同一分区上全部可用.在这种情况下,"guid的子字符串"也无济于事.

Second, only documents with the same partition key are guaranteed to all be available on the same partition if you need to involve them in a transactional stored procedure. A 'substring of a guid' also doesn't help with this case.

我几乎总是使用基于标识符"的分区键,例如您的产品ID.这并不总是与文档本身的"id"相对应.有时我有多个文档,这些文档的内容与同一件事相关.例如,如果我有一些产品信息是从另一个系统同步的,则该同步作业如果使用upsert可能是最有效的-但由于CosmosDB当前缺少部分更新支持(请参阅

I almost always use 'identifier' based partition keys such as your product id. This doesn't always correspond to the 'id' of the document itself. Sometimes I have multiple documents with content related to the same thing. For example, if I have some product information synced from another system, that sync job can be most efficient if it uses upsert - but due to current lack of partial update support in CosmosDB (see user voice) the whole document needs to be upserted. So in this case I have one document for the synced information, and a separate document for other information. This could look something like:

{
  "id": "12345:myinfo",
  "productid":"12345",
  "info":{}
  "type":"myinfotype"
},
{
  "id": "12345:vendorsync",
  "productid":"12345",
  "syncedinfo":{},
  "type":"vendorsync"
}

这里产品ID是分区键,我有几个与该产品相关的不同文档,我知道它们将驻留在同一分区上,因此我可以高效地查询它们或将它们包含在事务中.

Here the product id is the partition key, and I have a couple of different documents related to that product that I know will reside on the same partition so I can query them efficiently or involve them in a transaction.

在实现修订系统时,我也使用了这种模式,以确保同一逻辑文档的所有修订都放置在同一分区上.在那种情况下,文档的"documentid"对于所有修订都是相同的,而文档的实际"id"是添加了修订号的文档ID.

I have also used this pattern when implementing a revision system, so that all revisions of the same logical document are guaranteed to be placed on the same partition. In that case the document has a "documentid" that is the same for all revisions, and the actual "id" of the document is the document id with the revision number added.

如果还没有的话,还请在这里查看分区设计": https://docs.microsoft.com/en-us/azure/cosmos-db/partition-data

Please also review 'Design for Partitioning' here if you haven't already: https://docs.microsoft.com/en-us/azure/cosmos-db/partition-data

这篇关于在CosmosDB中使用GUID的子字符串作为partitionkey是一个坏主意吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆