在 Mongodb 上使用带有地理空间索引的全文搜索 [英] Using full text search with geospatial index on Mongodb

查看:19
本文介绍了在 Mongodb 上使用带有地理空间索引的全文搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我想开发一个 android 应用程序,允许用户搜索离您所在位置最近的酒店.这在当今的应用程序中非常普遍,例如 AirBnb.

这是我正在使用的数据集:

<代码>{"name" : "最棒的酒店","城市" : "印度",类型":点"坐标":[-56.16082,61.15392]}{"name" : "最不可思议的酒店","城市" : "印度",类型":点"坐标":[-56.56285,61.34590]}{"name" : "梦幻旅馆","城市" : "印度",类型":点"坐标":[-56.47085,61.11357]}

现在,我想在 name 字段上创建一个 文本索引,以便它按名称搜索,然后根据坐标按地理空间索引排序.

因此,如果我搜索The Most"这个词,它会按名称搜索The Most"这个词,并返回最近的酒店,其中包含The Most in them".

mongodb 是否支持这种类型的搜索?

我在这里阅读 mongodb 指南:https://docs.mongodb.org/manual/core/index-text/

<块引用>

复合文本索引不能包含任何其他特殊索引类型,例如多键或地理空间索引字段.

据我所知,我不是在创建复合文本索引.这是一个简单的文本索引,这意味着我只为 name 字段而不是 city AND name 字段索引文本.

解决方案

有一个合理的例子,你真的根本不需要这个,因为很难证明这样一个操作的用例是合理的,我认为搜索酒店" 不是文本"和地理空间"搜索真正适用的组合.

实际上大多数人"会寻找靠近某个地点的地方,或者更有可能靠近他们想要访问的各个地点,作为他们的一部分主要标准,然后是其他赢家"可能会更重视成本"、评级"、品牌"、设施",甚至可能与餐馆的距离等.

向该列表添加文本搜索"是一件非常不同的事情,在这个特定的应用程序中可能没有太多实际用途.

不过,这可能值得一些解释,这里有几个概念需要理解,为什么这两个概念至少在这个用例中并不真正网格化".

修复架构

首先,我想提出一个建议,以稍微调整"您的数据架构:

{"name" : "最棒的酒店","城市" : "印度",地点": {"type": "点",坐标":[72.867804,19.076033]}}

这至少证明 "location" 作为用于索引的有效 GeoJSON 对象,并且您通常需要 GeoJSON 而不是遗留坐标对,因为它确实为查询和存储打开了更多选项一般来说,加上距离标准化为米,而不是地球周围的等价弧度".

为什么他们不一起工作

所以你的阅读基本上是正确的,因为你不能一次使用多个特殊索引.先看复合索引定义:

db.hotels.createIndex({ "name": "text", "location": "2dsphere" })

<块引用>

{好的":0,"errmsg" : "bad index key pattern { name: "text", location: "2dsphere" }: 一个索引不能使用多个索引插件.",代码":67 }

所以这是不可能的.甚至单独考虑:

db.hotels.createIndex({ "name": "text" })db.hotels.createIndex({ "location": "2dsphere" })

然后尝试做一个查询:

db.hotels.find({地点": {$nearSphere":{$几何":{"type": "点",坐标":[72.867804,19.076033]}}},"$text": { "$search": "惊人" }})

<块引用>

错误:命令失败:{"waitedMS" : NumberLong(0),好的":0,"errmsg" : "文本和 geoNear 不允许出现在同一查询中",代码":2} : 未定义

这实际上从三个方面支持了无法在复合索引中定义的原因:

  1. 如初始错误所示,MongoDB 中处理这些特殊"索引的方式本质上需要分支"到所选索引类型的特殊"处理程序,并且这两个处理程序不存在于同一个地方.

  2. 即使使用单独的索引,由于逻辑基本上是与"条件,因此 MongoDB 无论如何都不能实际选择多个索引,并且由于两个查询子句都需要特殊"处理,因此实际上需要这样做所以.它不能.

  3. 即使这在逻辑上是 $or 条件,你基本上会回到第 1 点,即使应用索引交集",这种特殊"索引还有另一个属性,它们必须应用于查询操作的顶级",以便允许选择索引.将这些包装在 $or 中意味着 MongoDB 不能这样做,因此是不允许的.

但你可以作弊"

所以每个基本上都必须是独占的,你不能一起使用它们.但当然你总是可以作弊",这取决于哪种搜索顺序对你来说更重要.

首先是位置":

db.hotels.aggregate([{$geoNear":{靠近": {"type": "点",坐标":[72.867804,19.076033]},球形":真实,最大距离":5000,"distanceField": "距离",询问": {名称":/惊人/}}}])

甚至:

db.hotels.find({地点": {$nearSphere":{$几何":{"type": "点",坐标":[72.867804,19.076033]},$maxDistance":5000}},名称":/惊人/})

或者先通过文本搜索:

db.hotels.find({"$text": { "$search": "Amazing" },地点": {$geoWithin":{$centerSphere":[[72.867804,19.076033], 5000 ]}}})

现在您可以使用 .explain() 仔细查看每种方法中的选择选项,看看发生了什么,但基本情况是每种方法只选择一个特殊索引来分别使用.

在第一种情况下,它将是用于主要集合的 geoSpatial 索引,它将根据它们与首先给定的位置的接近程度来查找结果,然后通过为 名称给出的正则表达式参数进行过滤 字段.

在第二种情况下,它将使用文本"索引进行主要选择(因此首先找到惊人"的东西)并从这些结果中应用带有 $geoWithin,在这种情况下,它执行的是通过在 在提供的距离内围绕一个点圈起来以过滤那里的结果.

并非所有"查询都相等

但要考虑的关键是每种方法很可能返回不同的结果.通过首先缩小位置范围,唯一可以检查的数据是指定距离内的那些位置,因此附加过滤器永远不会考虑距离之外的任何惊人".

在第二种情况下,由于文本词是主要搜索,那么Amazing"的所有结果都被考虑在内,并且唯一项可以被二级过滤器返回的是允许从初始文本过滤器返回的那些.

这在总体考虑中非常重要,因为两个查询操作(text"和geoSpatial")力求实现截然不同的目标.在文本"情况下,它正在寻找给定术语的最佳结果",并且本质上只会返回有限数量的与该术语按排名顺序匹配的结果.这意味着在应用任何其他过滤条件时,满足第一个条件的许多项目很可能不满足附加条件.

简而言之,'并非所有令人惊奇"的东西都必须在查询点附近的任何地方',这意味着具有现实的限制,例如 100 个结果,并且通过最佳匹配,这 100 个很可能不包含所有附近"项目.

此外,$text 运算符实际上并没有真正以任何方式对结果进行排序".事实上,它的主要目的不仅是匹配"一个短语,而且是 "score" 结果以将最佳"匹配浮动到顶部.这通常是在查询本身之后"完成的,预计值被排序"并且很可能如上所述受限".在聚合管道中可能会这样做,然后应用第二个过滤器,但如上所述,这可能会排除其他目的中接近"的事物.

反过来也很可能是真的('离点更远有很多惊人"的东西'),但在现实的距离限制下,这种情况变得不太可能.但另一个考虑因素是这不是真正的文本搜索,而只是使用正则表达式来匹配给定的术语.

最后一点,我总是使用 "Amazing" 作为示例短语,而不是问题中建议的 "Most".这是因为词干提取"在此处的文本索引中(以及在大多数专用文本搜索产品中)的工作方式,特定术语将被忽略,很像 "and", "or", "the", 甚至 "in" 也是如此,因为它们对于一个短语来说并不真正有价值,而这正是文本搜索所做的.

所以事实上,如果确实需要,正则表达式实际上会更好地匹配这些术语.

总结

这确实让我们回到了原点,因为无论如何文本"查询确实不属于这里.其他有用的过滤器通常与真正的geoSpatial"搜索条件配合使用效果会更好,而真正的文本搜索"在重要内容列表中的位置非常低.

更有可能的是,人们想要一个位置在距离他们希望访问的目的地的*设置交叉点"之内,或者至少足够接近一些或大多数.当然,前面提到的其他因素(*价格"、服务"等)是人们普遍想要的.

以这种方式寻找结果并不是真正的合适的".如果您认为确实必须,则应用其中一种作弊"方法,或者实际上使用不同的查询,然后使用其他一些逻辑来合并每组结果.但是服务器单独做这件事确实没有意义,这就是它不尝试的原因.

所以我会首先专注于让您的地理空间匹配正确,然后应用其他对结果很重要的标准.但我真的不相信文本搜索"真的可以成为其中之一.改为作弊",但前提是您确实必须这样做.

Let's say I want to develop an android app that allows a user to search a hotel that is closest to where you are located. This is very common on apps nowadays, like AirBnb for example.

This is the dataset I'm using:

{
    "name" : "The Most Amazing Hotel",
    "city" : "India",
    "type": "Point"
    "coord": [
        -56.16082,
        61.15392
      ]
}

{
    "name" : "The Most Incredible Hotel",
    "city" : "India",
    "type": "Point"
    "coord": [
        -56.56285,
        61.34590
      ]
}

{
    "name" : "The Fantastic GuestHouse",
    "city" : "India",
    "type": "Point"
    "coord": [
        -56.47085,
        61.11357
      ]
}

Now, I want to create a text index on the name field so that it searches by name and then sort by a geospatial index based on the coordinates.

So if I search for the words "The Most", it will search by the name for the words "The Most" and return the closest hotels with the words "The Most in them.

Does mongodb even support this type of search?

I'm reading the guidance for mongodb here: https://docs.mongodb.org/manual/core/index-text/

A compound text index cannot include any other special index types, such as multi-key or geospatial index fields.

As far as I understand, I'm not creating a compound text index. This is a simple text index which means I'm only indexing the text for the name field and not for the city AND name fields.

解决方案

There is a fair case that you really do not need this at all, as it is very hard to justify a use case for such an operation, and I would argue that "Searching for a Hotel" is not something where a combination of "text" and "geoSpatial" search really apply.

In reality "most people" would be looking for something close to a location, or even more likely close to various locations they want to visit, as part of their primary criteria, and then other "winners" would likely be greater weighted to "cost", "rating", "brand", "facilities", and likely even proximity to eateries etc.

Adding "Text search" to that list is a very different thing and likely not of much real use in this particular application.

Still, this probably deserves some explanation, and there are a few concepts to understand here as to why the two concepts don't really "mesh" for this use case at least.

Fixing Schema

Firstly, I'd like to make a suggestion to "tweak" your data schema a little:

{
    "name" : "The Most Amazing Hotel",
    "city" : "India",
    "location": {
        "type": "Point",
        "coordinates": [
               72.867804,
               19.076033
        ]
    }
}

That at least provies "location" as a valid GeoJSON Object for indexing, and you generally want GeoJSON rather than legacy co-ordinate pairs, as it does open up more options for query and storage in general, plus distances are standardized to meters rather than the equated "radians" around the globe.

Why they don't work together

So your reading is basically correct in that you cannot use more than one special index at once. First look at the compound index definition:

db.hotels.createIndex({ "name": "text", "location": "2dsphere" })

{ "ok" : 0, "errmsg" : "bad index key pattern { name: "text", location: "2dsphere" }: Can't use more than one index plugin for a single index.", "code" : 67 }

So that cannot be done. Even considering seperately:

db.hotels.createIndex({ "name": "text" })
db.hotels.createIndex({ "location": "2dsphere" })

Then try doing a query:

db.hotels.find({
    "location": {
        "$nearSphere": {
            "$geometry": {
                "type": "Point",
                "coordinates": [
                   72.867804,
                   19.076033
                ]
            }
        }
    },
    "$text": { "$search": "Amazing" }
})

Error: command failed: { "waitedMS" : NumberLong(0), "ok" : 0, "errmsg" : "text and geoNear not allowed in same query", "code" : 2 } : undefined

Which actually backs up the reasons why this could not be defined in a compound index in three ways:

  1. As the initial error indicates, the way these "special" indexes are handled in MongoDB requires essentially "branching off" to the "special" handler for the selected index type, and the two handlers do not live in the same place.

  2. Even with separate indexes, since the logic is basically an "and" condition, MongoDB cannot acutally select more than one index anyway, and since both query clauses require "special" handling it would in fact be required to do so. And it cannot.

  3. Even if this were logically an $or condition, you basically end back at point 1, where even applying "index intersection" there is another property of such "special" indexes that they must be applied at the "top level" of the query operations in order to allow index selection. Wrapping these in an $or means MongoDB cannot do that and therefore it is not allowed.

But you can "Cheat"

So each basically has to be exclusive, and you cannot use them together. But of course you can always "cheat", depending on which order of search is more important to you.

Either by "location" first:

db.hotels.aggregate([
    { "$geoNear": {
        "near": {
            "type": "Point",
            "coordinates": [
               72.867804,
               19.076033
            ]
        },
        "spherical": true,
        "maxDistance": 5000,
        "distanceField": "distance",
        "query": {
           "name": /Amazing/
        }
    }}
])

Or even:

db.hotels.find({
    "location": {
        "$nearSphere": {
            "$geometry": {
                "type": "Point",
                "coordinates": [
                   72.867804,
                   19.076033
                ]
            },
            "$maxDistance": 5000
        }
    },
    "name": /Amazing/
})

Or by text search first:

db.hotels.find({
    "$text": { "$search": "Amazing" },
    "location": {
        "$geoWithin": {
            "$centerSphere": [[
               72.867804,
               19.076033
            ], 5000 ]
        }
    }
})

Now you can take a close look at the selection options in each approach with .explain() to see what is happening, but the basic case is that each selects only one of the special indexes to use respectively.

In the first case it will be the geoSpatial index on the collection that is used for the primary and will find results based on their proximity to the location given first and then filter by the Regular Expression argument given for the name field.

In the second case it will use the "text" index to do the primary selection ( therefore find things "Amazing" first ) and from those results apply a geoSpatial filter ( not using an index ) with $geoWithin, which in this case is performing what is basically the equivalent of of what a $near is doing, by searching within a circle around a point within the supplied distance to filter results there.

Not "all" Queries are Equal

The key thing to consider though is that it is very possible for each approach to return different results. By narrowing down on location first, the only data that can be inspected are those locations within the specified distance, so anything that is "Amazing" outside of the distance would never be considered by the additional filter.

In the second case, since the text term is the primary search, then all results of "Amazing" are put into consideration, and the only items that can be returned by the secondary filter are those that were allowed to be returned from the initial text filter.

This is very important in the overall consideration as the two query operations ( both "text" and "geoSpatial" ) strive to achieve very different things. In the "text" case it is looking for "top results" to the term given, and will by nature only return a limitted number of results matching the term in ranked order. This means that when applying any other filter condition, there is a strong possibility that many of the items that met that first condition do not meet the additional criteria.

In short, 'Not all things "Amazing" are necessarily anywhere near the queried point', which means with a realistic limit like 100 results, and by best match, those 100 do likely not contain all of the "near" items as well.

Also, the $text operator does not actually really "sort" the results in any way by itself. It's primary purpose is in fact not only to "match" on a phrase but to "score" the result in order to float the "best" match to the top. This is typically done "after" the query itself with the projected value being "sorted" and most likely "limited" as mentioned above. Possible in aggregation pipelines to do that an then apply the second filter(s), but as stated this likely excludes things that are otherwise "near" in the other purpose.

The reverse is also likely true ( 'There are many "Amazing" things further away from the point' ), but with realistic distance limits this becomes less likely. But the other consideration given is this is not a true text search, but just using a regular expression to match the given term.

As a final note, I'm always using "Amazing" as the example phrase here and not "Most" as suggested in the question. This is because of how "stemming" works in text indexes here ( as well as in most dedicated text search products ) in that the particular term would be ignored, much like "and", "or", "the", even "in" would be as well, as they are not really considered valuable to a Phrase, which is what text search does.

So it in fact remains that a Regular expression, would actually be better at matching such terms, if indeed that were required at all.

Concluding

Which really brings us back full circle to the original point, in that a "text" query really does not belong here anyway. The other useful filters usally work in tandem with the true "geoSpatial" search criteria the better, and true "text search" is really low on the list of what would be important.

More likely is that people want a location that lies within a *"Set Intersection" of distances from desired destinations they wish to visit, or at least near enough to some, or most. Then of course other factors ( *"price", "service" etc ) as mentioned earlier are things people want in general consideration.

It's not really a "good fit" to look for the results this way. If you think you really must, then apply one of the "cheat" approaches, or in fact use different queries and then some other logic to merge each set of results. But it really does not make sense for the server to do this alone, which is why it does not try.

So I would focus on getting your geoSpatial matches right first, then apply other critieria that should be important to results. But I don't really believe that "text search" is really valid to be one of them anyway. "Cheat" instead, but only if you really must.

这篇关于在 Mongodb 上使用带有地理空间索引的全文搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆