Azure Cosmos DB聚合和索引 [英] Azure Cosmos DB aggregation and indexes

查看:88
本文介绍了Azure Cosmos DB聚合和索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Cosmos DB,但在集合中进行简单计数时遇到了一些麻烦.

I'm trying to use Cosmos DB and I'm having some trouble making a simple count in a collection.

我的收藏夹结构如下,该收藏夹中有80.000个文档.

My collection schema is below and I have 80.000 documents in this collection.

{
    "_id" : ObjectId("5aca8ea670ed86102488d39d"),
    "UserID" : "5ac161d742092040783a4ee1",
    "ReferenceID" : 87396,
    "ReferenceDate" : ISODate("2018-04-08T21:50:30.167Z"),
    "ElapsedTime" : 1694,
    "CreatedDate" : ISODate("2018-04-08T21:50:30.168Z")
}

如果我在下面运行此命令来统计集合中的所有文档,那么结果将是如此之快:

If I run this command below to count all documents in collection, I have the result so quickly:

db.Tests.count()

但是当我对特定用户运行相同的命令时,我收到一条消息请求率很大".

But when I run this same command but to a specific user, I've got a message "Request rate is large".

db.Tests.find({UserID:"5ac161d742092040783a4ee1"}).count()

在Cosmos DB文档中,我找到了这种情况,建议是增加RU.目前我有400 RU/s,当我增加到10.000 RU/s时,我能够在5秒内无错误地运行命令.

In the Cosmos DB documentation I found this cenario and the suggestion is increase RU. Currently I have 400 RU/s, when I increase to 10.000 RU/s I'm capable to run the command with no errors but in 5 seconds.

我已经尝试过显式创建索引,但是Cosmos DB似乎没有使用索引进行计数.

I already tryed to create index explicity, but it seems Cosmos DB doesn't use the index to make count.

我认为为大约100,000个文档的集合中的简单计数支付10,000 RU/s是不合理的,尽管这大约需要5秒钟.

I do not think it is reasonable to have to pay 10,000 RU / s for a simple count in a collection with approximately 100,000 documents, although it takes about 5 seconds.

推荐答案

按过滤器查询计数如果有索引,则使用索引.

Count by filter queries ARE using indexes if they are available.

如果您尝试对未索引的列进行按过滤器计数,则查询不会超时,但会失败.试试吧.您应该会遇到以下错误:

If you try count by filter on a not indexed column the query would not time out, but fail. Try it. You should get error along the lines of:

{错误":[[已为过滤器指定了无效的查询,该过滤器针对从索引中排除的路径.请考虑在请求中添加允许扫描标头.]

{"Errors":["An invalid query has been specified with filters against path(s) excluded from indexing. Consider adding allow scan header in the request."]}

因此,请务必在UserID上添加合适的索引.

So definitely add a suitable index on UserID.

如果您没有索引覆盖范围并且没有收到上述错误,则您可能设置了

If you don't have index coverage and don't get the above error then you probably have set the enableScanInQuery flag. This is almost always a bad idea, and full scan would not scale. Meaning - it would consume increasingly large amounts of RU as your dataset grows. So make sure it is off and index instead.

当您确实在所选列上具有索引时,您的查询应运行.您可以通过发送

When you DO have index on the selected column your query should run. You can verify that index is actually being used by sending the x-ms-documentdb-populatequerymetrics header. Which should return you confirmation with indexLookupTimeInMs and indexUtilizationRatio field. Example output:

"totalExecutionTimeInMs = 8.44; queryCompileTimeInMs = 8.01; queryLogicalPlanBuildTimeInMs = 0.04; queryPhysicalPlanBuildTimeInMs = 0.06; queryOptimizationTimeInMs = 0.00; VMExecutionTimeInMs = 0.14; indexLookupTimeInMss = 0.11 ; ; retrievedDocumentCount = 0; retrievedDocumentSize = 0; outputDocumentCount = 1; outputDocumentSize = 0; writeOutputTimeInMs = 0.01; indexUtilizationRatio = 0.00 "

"totalExecutionTimeInMs=8.44;queryCompileTimeInMs=8.01;queryLogicalPlanBuildTimeInMs=0.04;queryPhysicalPlanBuildTimeInMs=0.06;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.14;indexLookupTimeInMs=0.11;documentLoadTimeInMs=0.00;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=0;retrievedDocumentSize=0;outputDocumentCount=1;outputDocumentSize=0;writeOutputTimeInMs=0.01;indexUtilizationRatio=0.00"

如果您觉得RU费用过高,它还可以为您提供一些见识,说明努力的方向.

It also provides you some insight where the effort has gone if you feel like RU charge is too large.

如果索引查找时间本身太长,请考虑索引是否足够有选择性,以及索引设置是否合适.查看您的UserId值和分布并相应地调整索引.

If index lookup time itself is too high, consider if you index is selective enough and if the index settings are suitable. Look at your UserId values and distribution and adjust the index accordingly.

要考虑的另一个大胆猜测是,检查所使用的API是否会推迟执行find(..) ,直到知道count()确实是您要的.目前尚不清楚您使用的是哪个API.如果事实证明是要在进行计数之前将所有匹配的文件都提取到客户端,那么这将解释出意外的RU成本高昂的情况,尤其是在涉及大量匹配文件或大文件的情况下. 查看API文档.

Another wild guess to consider is to check if the API you are using would defer executing find(..) until it knows that count() is really what you are after. It is unclear which API you are using. If it turns out it is fetching all matching documents to client side before doing the counting then that would explain unexpectedly high RU cost, especially if there are large amount of matching documents or large documents involved. Check the API documentation.

我还建议直接在Azure门户中执行相同的查询以比较RU成本并验证问题是否与客户端相关.

I also suggest executing the same query directly in Azure Portal to compare the RU cost and verify if the issue is client-related or not.

这篇关于Azure Cosmos DB聚合和索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆