CosmosDB中的索引数组 [英] Indexing arrays in CosmosDB

查看:115
本文介绍了CosmosDB中的索引数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么默认情况下CosmosDB不索引数组?默认索引路径是

Why doesn't CosmosDB index arrays by default? The default index path is

"path": "/*"

这不是要为所有内容编制索引"吗?不是为除数组之外的所有内容编制索引".

Doesn't that mean "index everything"? Not "index everything except arrays".

如果我将数组字段添加到索引中,如下所示:

If I add my array field to the index with something like this:

"path": "/tags/[]/?"

它将起作用并开始为该特定数组字段建立索引.

It will work and start indexing that particular array field.

但是我的问题是为什么不对所有内容编制索引?

But my question is why doesn't "index everything" index everything?

这是一篇博客文章,描述了我所看到的行为. http://www.devwithadam.com/2017/08/querying-for-items-in-array-in-cosmosdb.html Array_Contains查询速度非常慢,显然不使用索引.如果您将有问题的字段显式添加到索引中,则查询速度很快(显然,它们开始使用索引).

Here's a blog post that describes the behavior I'm seeing. http://www.devwithadam.com/2017/08/querying-for-items-in-array-in-cosmosdb.html Array_Contains queries are very slow, clearly not using the index. If you add the field in question to the index explicitly then the queries are fast (clearly they start using the index).

推荐答案

新"索引布局

索引类型

"New" index layout

As stated in Index Types

Azure Cosmos容器支持不再使用的新索引布局 哈希索引类型.如果在索引上指定哈希索引类型 策略,容器上的CRUD请求将无提示地忽略 索引种类,并且容器的响应仅包含范围 索引种类.所有新的Cosmos容器都使用新的索引布局, 默认.

Azure Cosmos containers support a new index layout that no longer uses the Hash index kind. If you specify a Hash index kind on the indexing policy, the CRUD requests on the container will silently ignore the index kind and the response from the container only contains the Range index kind. All new Cosmos containers use the new index layout by default.

以下问题不适用于新的索引布局.在那里,默认的索引策略可以很好地工作(并在36.55 RUs中提供结果).但是,先前存在的馆藏可能仍在使用旧版式.

The below issue does not apply to the new index layout. There the default indexing policy works fine (and delivers the results in 36.55 RUs). However pre-existing collections may still be using the old layout.

我能够用您正在询问的ARRAY_CONTAINS重现该问题.

I was able to reproduce the issue with ARRAY_CONTAINS that you are asking about.

使用SO数据转储中的100,000个帖子设置CosmosDB集合(例如,该问题将表示如下)

Setting up a CosmosDB collection with 100,000 posts from the SO data dump (e.g. this question would be represented as below)

{
    "id": "50614926",
    "title": "Indexing arrays in CosmosDB",
     /*Other irrelevant properties omitted */
    "tags": [
        "azure",
        "azure-cosmosdb"
    ]
}

然后执行以下查询

SELECT COUNT(1)
FROM t IN c.tags
WHERE t = 'sql-server'

使用默认索引策略的查询超过2,000个RU,添加了以下内容的查询超过了93个(如您的链接文章中所示)

The query took over 2,000 RUs with default indexing policy and 93 with the following addition (as shown in your linked article)

{
    "path": "/tags/[]/?",
    "indexes": [
        {
            "kind": "Hash",
            "dataType": "String",
            "precision": -1
        }
    ]
}

不过,您在这里看到的是,默认情况下未对数组值进行索引.只是默认范围索引对您的查询没有用.

However what you are seeing here is not that the array values aren't being indexed by default. It is just that the default range index is not useful for your query.

范围索引使用基于部分前向路径的键.因此将包含如下路径.

The range index uses keys based on partial forward paths. So will contain paths such as the following.

  • tags/0/azure
  • tags/0/c#
  • tags/0/oracle
  • tags/0/sql-server
  • tags/1/azure-cosmosdb
  • tags/1/c#
  • tags/1/sql-server
  • tags/0/azure
  • tags/0/c#
  • tags/0/oracle
  • tags/0/sql-server
  • tags/1/azure-cosmosdb
  • tags/1/c#
  • tags/1/sql-server

使用此索引结构,它从tags/0/sql-server开始,然后读取所有剩余的tags/0/条目以及tags/n/的所有条目,其中n是大于0的整数.映射到其中任何一个的每个不同文档都需要检索和评估.

With this index structure it starts at tags/0/sql-server and then reads all of the remaining tags/0/ entries and the entirety of the entries for tags/n/ where n is an integer greater than 0. Each distinct document mapping to any of these needs to be retrieved and evaluated.

相比之下,哈希索引使用反向路径(更多详细信息-PDF )

By contrast the hash index uses reverse paths (more details - PDF)

理论上,StackOverflow允许每个问题最多添加5个标签,因此在这种情况下(忽略了一些问题,通过站点管理活动具有更多标签的事实),我们感兴趣的反向路径是

StackOverflow theoretically allows a maximum of 5 tags per question to be added by the UI so in this case (ignoring the fact that a few questions have more tags through site admin activities) the reverse paths of interest are

  • sql-server/0/tags
  • sql-server/1/tags
  • sql-server/2/tags
  • sql-server/3/tags
  • sql-server/4/tags
  • sql-server/0/tags
  • sql-server/1/tags
  • sql-server/2/tags
  • sql-server/3/tags
  • sql-server/4/tags

使用反向路径结构查找具有sql-server值的叶节点的所有路径是直接的.

With the reverse path structure finding all paths with leaf nodes of value sql-server is straight forward.

在这种特定的使用情况下,由于数组被限制为最多5个可能的值,因此仅查看那些特定路径,也有可能有效地使用原始范围索引.

In this specific use case as the arrays are bounded to a maximum of 5 possible values it is also possible to use the original range index efficiently by looking at just those specific paths.

以下查询在我的测试集中使用了默认索引策略的97个RU.

The following query took 97 RUs with default indexing policy in my test collection.

SELECT COUNT(1)
FROM c
WHERE  'sql-server' IN (c.tags[0], c.tags[1], c.tags[2], c.tags[3], c.tags[4])

这篇关于CosmosDB中的索引数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆