在Mongodb上使用全文搜索和地理空间索引 [英] Using full text search with geospatial index on Mongodb

查看:271
本文介绍了在Mongodb上使用全文搜索和地理空间索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我想开发一个android应用程序,允许用户搜索离您所在位置最近的酒店。这在现在的应用中非常常见,例如AirBnb。



这是我正在使用的数据集:



< pre $ {
name:The Most Amazing Hotel,
city:India,
type:Point
coord:[
-56.16082,
61.15392
]
}

{
name:The最难以置信的酒店,
city:印度,
类型:Point
coord:[
-56.56285,
61.34590


$ name
name:The Fantastic GuestHouse,
city:India,
type :Point
coord:[
-56.47085,
61.11357
]
}

现在,我想在名称字段中创建文本索引,以便它会根据名称进行搜索,然后根据坐标进行地理空间索引排序。



因此,如果我搜索单词The Most,它将搜索以The Most一词的名字命名,并将最接近的酒店以The Most in them。的字样返回。



mongodb是否支持这种类型的搜索? p>

我正在阅读有关mongodb的指导: https://docs.mongodb.org/manual/core/index-text/


复合文本索引不能包含任何其他特殊索引类型,例如多键或地理空间索引字段等

至于I理解,我没有创建复合文本索引。这是一个简单的文本索引,这意味着我只索引名称字段的文本,而不是为 city AND name 字段。

解决方案

有一个公平的例子,你根本不需要这个,因为很难证明这种操作的用例是合理的,我会争辩说:搜索酒店 不是在这种情况下,text和geoSpatial搜索的组合真的适用。



实际上,大多数人作为其主要标准的一部分,甚至更可能接近他们想要访问的各个位置,然后其他获胜者可能会更大地受到成本的影响, 评级,品牌,设施,甚至可能接近餐馆等等。

添加文本搜索到这个列表是一个非常不同的事情,并且可能在这方面没有多大用处特定的应用程序。

但是,这可能值得一些解释,并且有几个概念可以理解为什么这两个概念不是真正的网格至少为这个用例。

修复架构



首先,我想建议稍微调整你的数据模式:

{
名称:最惊人的酒店,
city:印度,
位置:{
type:Point,
coordinates :[
72.867804,
19.076033
]
}
}

至少提供location作为索引的有效GeoJSON对象,并且您通常需要GeoJSON而不是传统坐标对它确实为查询和存储开辟了更多的选择,加上距离标准化为米,而不是等同的弧度



为什么他们不一起工作



所以你的阅读基本上是正确的,因为你不能一次使用多个特殊索引。首先看看复合索引定义:
$ b

  db.hotels.createIndex({name: text,location:2dsphere})




{
ok:0,
errmsg:错误索引键模式{名称:\text \,位置:\2dsphere \}:无法使用更多而不是一个索引插件的单个索引。,
code:67}

所以这是无法完成的。即使单独考虑:

  db.hotels.createIndex({name:text} )
db.hotels.createIndex({location:2dsphere})

然后尝试做一个查询:

  db.hotels.find({
location :{
$ nearSphere:{
$ geometry:{
type:Point,
coordinates:[
72.867804,
19.076033




$ text $ {$ search:令人惊叹的}
} )




错误:命令失败:{
waitedMS :NumberLong(0),
ok:0,
errmsg:同一查询中不允许使用text和geoNear,
code:2
}:未定义


其中真正支持说明为什么无法通过三种方式在复合指数中定义此原因:


  1. 正如最初的错误所表明的,这些特殊索引在MongoDB中处理时,基本上需要为所选索引类型的特殊处理程序分支,并且这两个处理程序不在同一个地方。

  2. $ b $即使使用单独的索引,由于逻辑基本上是一个和条件,因此无论如何,MongoDB无法实际选择多个索引,并且由于两个查询子句都需要特殊处理,因此实际上需要这样做。即使这在逻辑上是一个 $或条件,你基本上可以退回到点1,即使应用索引相交,也存在这样的特殊索引的另一个属性,它们必须在查询操作的最高级别应用,以允许索引选择。将这些包装在 $或中意味着MongoDB无法做到这一点,因此不允许。




但你可以作弊



因此,每一个都必须是排他性的,你不能一起使用它们。但当然你总是可以作弊,这取决于搜索顺序对你来说更重要。



首先通过location:

db.hotels.aggregate([
{$ geoNear:{
near :{
type:Point,
coordinates:[
72.867804,
19.076033
]
},
球形:true,
maxDistance:5000,
distanceField:distance,
query:{
name:/ Amazing /
}
}}
])

甚至:

db.hotels.find({
location:{
$ nearSphere :{
$ geometry:{
type:Point,
coordinates:[
72.867804,
19.076033
]
},
$ maxDistance:5000
}
},
name:/ Amazing /
})

或者先搜索文字:

location:{
$ geoWithin:{
$ centerSphere:[[
72.867804,
19.076033
],5000]
}
}





$ b

现在你可以仔细看看每种方法中的选择选项,使用 .explain()来查看正在发生的事情,但基本情况是每个分别只选择一个要使用的特殊索引。



在第一种情况下,它将是集合上用于主数据库的地理空间索引,并将根据它们与首先给定的位置的邻近度找到结果,然后根据为名称字段提供了正则表达式参数。在第二种情况下,它将使用text索引来做主要选择(因此首先找到令人惊叹的东西),并从这些结果中应用geoSpatial过滤器(不使用索引)和 $ geoWithin ,在这种情况下,它基本上相当于 $ near 正在做,通过在

不是全部查询是相等的



要考虑的关键是,每种方法都有可能返回不同的结果。通过首先缩小位置,唯一可以检测的数据是指定距离内的位置,因此任何超出距离的令人惊叹的数据将永远不会被附加过滤器考虑。



在第二种情况下,由于文本条件是主要搜索,因此会考虑Amazing的所有结果,仅 可以由二级过滤器返回的项目是那些允许从初始文本过滤器返回的项目。



这在整体考虑中非常重要,因为两个查询操作(文本和地理空间)力求实现完全不同的目标。在文本情况下,它正在寻找给定术语的最佳结果,并且本质上只会返回与排名顺序中的术语匹配的限制数量的结果。这意味着,在应用任何其他过滤条件时,很有可能很多满足第一个条件的项目不符合附加条件。

总之, '不是所有的东西惊人的都必须靠近被查询的点,这意味着具有像 100结果这样的实际限制,并且通过最佳匹配,那100个可能不包含所有的近物品。



另外, $ text 运算符实际上并没有真正以任何方式自行对结果进行排序。它的主要目的实际上不仅仅是在短语上匹配,而且还有分数结果,以便将最佳匹配浮动到顶部。这通常是在查询本身之后完成的,其投影值如上所述是排序并且很可能是有限的。可能在聚合管道中做到这一点,然后应用第二个过滤器,但如前所述,这可能排除了其他目的的接近的东西。



反过来也可能是真的('有很多'惊人的'事情离'点'比较远),但是现实距离限制的可能性较小。但给出的另一个考虑是这不是一个真正的文本搜索,而是使用一个正则表达式来匹配给定的词。



作为一个最后一点,我总是使用Amazing作为示例短语,而不是Most这个问题。这是因为词干在文本索引(以及大多数专用文本搜索产品)中是如何工作的,因为该特定词将忽略,很像和,或,,甚至在中也是如此,因为它们对于短语来说并不真正被认为是有价值的,这就是文本搜索的功能。



事实上,正则表达式实际上在匹配这些条款方面会更好,如果确实需要的话。



< h2>结束

这确实使我们回到原始点的整个圆圈,因为文本查询实际上不属于这里。其他有用的过滤器通常与真正的地理空间搜索标准一起工作,越好越好,真正的文本搜索在真正重要的清单上很低。

更有可能的是,人们希望位于距离他们希望访问的期望目的地的距离的集合交叉路口内的位置,或者至少接近一些或大部分的位置。当然,前面提到的其他因素(*价格,服务等)是人们想要一般考虑的事情。



这不是真正的非常适合以这种方式查找结果。如果您认为自己确实必须这样做,那么应用其中一种作弊方法,或者实际上使用不同的查询,然后使用其他逻辑来合并每组结果。但它确实没有意义的服务器单独这样做,这就是为什么它不尝试。

因此,我会专注于让您的地理空间匹配第一,然后运用其他应该对结果重要的批评。但我不相信文本搜索无论如何都是真正有效的。 作弊,但只有当你真的必须。


Let's say I want to develop an android app that allows a user to search a hotel that is closest to where you are located. This is very common on apps nowadays, like AirBnb for example.

This is the dataset I'm using:

{
    "name" : "The Most Amazing Hotel",
    "city" : "India",
    "type": "Point"
    "coord": [
        -56.16082,
        61.15392
      ]
}

{
    "name" : "The Most Incredible Hotel",
    "city" : "India",
    "type": "Point"
    "coord": [
        -56.56285,
        61.34590
      ]
}

{
    "name" : "The Fantastic GuestHouse",
    "city" : "India",
    "type": "Point"
    "coord": [
        -56.47085,
        61.11357
      ]
}

Now, I want to create a text index on the name field so that it searches by name and then sort by a geospatial index based on the coordinates.

So if I search for the words "The Most", it will search by the name for the words "The Most" and return the closest hotels with the words "The Most in them.

Does mongodb even support this type of search?

I'm reading the guidance for mongodb here: https://docs.mongodb.org/manual/core/index-text/

A compound text index cannot include any other special index types, such as multi-key or geospatial index fields.

As far as I understand, I'm not creating a compound text index. This is a simple text index which means I'm only indexing the text for the name field and not for the city AND name fields.

解决方案

There is a fair case that you really do not need this at all, as it is very hard to justify a use case for such an operation, and I would argue that "Searching for a Hotel" is not something where a combination of "text" and "geoSpatial" search really apply.

In reality "most people" would be looking for something close to a location, or even more likely close to various locations they want to visit, as part of their primary criteria, and then other "winners" would likely be greater weighted to "cost", "rating", "brand", "facilities", and likely even proximity to eateries etc.

Adding "Text search" to that list is a very different thing and likely not of much real use in this particular application.

Still, this probably deserves some explanation, and there are a few concepts to understand here as to why the two concepts don't really "mesh" for this use case at least.

Fixing Schema

Firstly, I'd like to make a suggestion to "tweak" your data schema a little:

{
    "name" : "The Most Amazing Hotel",
    "city" : "India",
    "location": {
        "type": "Point",
        "coordinates": [
               72.867804,
               19.076033
        ]
    }
}

That at least provies "location" as a valid GeoJSON Object for indexing, and you generally want GeoJSON rather than legacy co-ordinate pairs, as it does open up more options for query and storage in general, plus distances are standardized to meters rather than the equated "radians" around the globe.

Why they don't work together

So your reading is basically correct in that you cannot use more than one special index at once. First look at the compound index definition:

db.hotels.createIndex({ "name": "text", "location": "2dsphere" })

{ "ok" : 0, "errmsg" : "bad index key pattern { name: \"text\", location: \"2dsphere\" }: Can't use more than one index plugin for a single index.", "code" : 67 }

So that cannot be done. Even considering seperately:

db.hotels.createIndex({ "name": "text" })
db.hotels.createIndex({ "location": "2dsphere" })

Then try doing a query:

db.hotels.find({
    "location": {
        "$nearSphere": {
            "$geometry": {
                "type": "Point",
                "coordinates": [
                   72.867804,
                   19.076033
                ]
            }
        }
    },
    "$text": { "$search": "Amazing" }
})

Error: command failed: { "waitedMS" : NumberLong(0), "ok" : 0, "errmsg" : "text and geoNear not allowed in same query", "code" : 2 } : undefined

Which actually backs up the reasons why this could not be defined in a compound index in three ways:

  1. As the initial error indicates, the way these "special" indexes are handled in MongoDB requires essentially "branching off" to the "special" handler for the selected index type, and the two handlers do not live in the same place.

  2. Even with separate indexes, since the logic is basically an "and" condition, MongoDB cannot acutally select more than one index anyway, and since both query clauses require "special" handling it would in fact be required to do so. And it cannot.

  3. Even if this were logically an $or condition, you basically end back at point 1, where even applying "index intersection" there is another property of such "special" indexes that they must be applied at the "top level" of the query operations in order to allow index selection. Wrapping these in an $or means MongoDB cannot do that and therefore it is not allowed.

But you can "Cheat"

So each basically has to be exclusive, and you cannot use them together. But of course you can always "cheat", depending on which order of search is more important to you.

Either by "location" first:

db.hotels.aggregate([
    { "$geoNear": {
        "near": {
            "type": "Point",
            "coordinates": [
               72.867804,
               19.076033
            ]
        },
        "spherical": true,
        "maxDistance": 5000,
        "distanceField": "distance",
        "query": {
           "name": /Amazing/
        }
    }}
])

Or even:

db.hotels.find({
    "location": {
        "$nearSphere": {
            "$geometry": {
                "type": "Point",
                "coordinates": [
                   72.867804,
                   19.076033
                ]
            },
            "$maxDistance": 5000
        }
    },
    "name": /Amazing/
})

Or by text search first:

db.hotels.find({
    "$text": { "$search": "Amazing" },
    "location": {
        "$geoWithin": {
            "$centerSphere": [[
               72.867804,
               19.076033
            ], 5000 ]
        }
    }
})

Now you can take a close look at the selection options in each approach with .explain() to see what is happening, but the basic case is that each selects only one of the special indexes to use respectively.

In the first case it will be the geoSpatial index on the collection that is used for the primary and will find results based on their proximity to the location given first and then filter by the Regular Expression argument given for the name field.

In the second case it will use the "text" index to do the primary selection ( therefore find things "Amazing" first ) and from those results apply a geoSpatial filter ( not using an index ) with $geoWithin, which in this case is performing what is basically the equivalent of of what a $near is doing, by searching within a circle around a point within the supplied distance to filter results there.

Not "all" Queries are Equal

The key thing to consider though is that it is very possible for each approach to return different results. By narrowing down on location first, the only data that can be inspected are those locations within the specified distance, so anything that is "Amazing" outside of the distance would never be considered by the additional filter.

In the second case, since the text term is the primary search, then all results of "Amazing" are put into consideration, and the only items that can be returned by the secondary filter are those that were allowed to be returned from the initial text filter.

This is very important in the overall consideration as the two query operations ( both "text" and "geoSpatial" ) strive to achieve very different things. In the "text" case it is looking for "top results" to the term given, and will by nature only return a limitted number of results matching the term in ranked order. This means that when applying any other filter condition, there is a strong possibility that many of the items that met that first condition do not meet the additional criteria.

In short, 'Not all things "Amazing" are necessarily anywhere near the queried point', which means with a realistic limit like 100 results, and by best match, those 100 do likely not contain all of the "near" items as well.

Also, the $text operator does not actually really "sort" the results in any way by itself. It's primary purpose is in fact not only to "match" on a phrase but to "score" the result in order to float the "best" match to the top. This is typically done "after" the query itself with the projected value being "sorted" and most likely "limited" as mentioned above. Possible in aggregation pipelines to do that an then apply the second filter(s), but as stated this likely excludes things that are otherwise "near" in the other purpose.

The reverse is also likely true ( 'There are many "Amazing" things further away from the point' ), but with realistic distance limits this becomes less likely. But the other consideration given is this is not a true text search, but just using a regular expression to match the given term.

As a final note, I'm always using "Amazing" as the example phrase here and not "Most" as suggested in the question. This is because of how "stemming" works in text indexes here ( as well as in most dedicated text search products ) in that the particular term would be ignored, much like "and", "or", "the", even "in" would be as well, as they are not really considered valuable to a Phrase, which is what text search does.

So it in fact remains that a Regular expression, would actually be better at matching such terms, if indeed that were required at all.

Concluding

Which really brings us back full circle to the original point, in that a "text" query really does not belong here anyway. The other useful filters usally work in tandem with the true "geoSpatial" search criteria the better, and true "text search" is really low on the list of what would be important.

More likely is that people want a location that lies within a *"Set Intersection" of distances from desired destinations they wish to visit, or at least near enough to some, or most. Then of course other factors ( *"price", "service" etc ) as mentioned earlier are things people want in general consideration.

It's not really a "good fit" to look for the results this way. If you think you really must, then apply one of the "cheat" approaches, or in fact use different queries and then some other logic to merge each set of results. But it really does not make sense for the server to do this alone, which is why it does not try.

So I would focus on getting your geoSpatial matches right first, then apply other critieria that should be important to results. But I don't really believe that "text search" is really valid to be one of them anyway. "Cheat" instead, but only if you really must.

这篇关于在Mongodb上使用全文搜索和地理空间索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆