实现使用MongoDB的搜索自动完成功能 [英] Implement auto-complete feature using MongoDB search

查看:163
本文介绍了实现使用MongoDB的搜索自动完成功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有形式的文档的MongoDB

{
    "id": 42,
    "title": "candy can",
    "description": "canada candy canteen",
    "brand": "cannister candid",
    "manufacturer": "candle canvas"
}

我需要除了 ID 在田间配套实施基于输入检索词自动完成功能。例如,如果输入项是,那么我应该返回文档中的所有匹配的

I need to implement auto-complete feature based on the input search term by matching in the fields except id. For example, if the input term is can, then I should return all matching words in the document as

{ hints: ["candy", "can", "canada", "canteen", ...]

我看着这个问题,但它并没有帮助。我也试图寻找该怎么办正则表达式在多个领域和提取匹配的令牌,或者提取匹配的令牌在MongoDB中搜索文本搜索,但找不到任何帮助。

I looked at this question but it didn't help. I also tried searching how to do regex search in multiple fields and extract matching tokens, or extracting matching tokens in a MongoDB text search but couldn't find any help.

推荐答案

有是你想要的东西不容易解决,因为正常的查询,无法修改它们返回的字段。有一个解决方案(使用下面马preduce联,而不是做一个输出到一个集合),但除了非常小的数据库,它是不可能做到这一点实时

tl;dr

There is no easy solution for what you want, since normal queries can't modify the fields they return. There is a solution (using the below mapReduce inline instead of doing an output to a collection), but except for very small databases, it is not possible to do this in realtime.

作为写,一个正常的查询,并不能真正改变它返回的字段。但也有其他的问题。如果你想要做半路出家的时间正则表达式搜索,你将不得不指数的所有的领域,这将需要的RAM不成比例数额的功能。如果你不会索引的所有的领域,正则表达式搜索将导致的收集扫描,这意味着每个文件都必须从磁盘,这将花费太多的时间自动完成,以方便装载。此外,多个并发用户请求自动完成将在后端创造了可观的负荷。

As written, a normal query can't really modify the fields it returns. But there are other problems. If you want to do a regex search in halfway decent time, you would have to index all fields, which would need a disproportional amount of RAM for that feature. If you wouldn't index all fields, a regex search would cause a collection scan, which means that every document would have to be loaded from disk, which would take too much time for autocompletion to be convenient. Furthermore, multiple simultaneous users requesting autocompletion would create considerable load on the backend.

问题是非常相似一个我已经回答:我们需要的每一个字解压出来的多个领域,除去停止词并与链接到相应的文件(S)字一起保存剩余的词在被发现采集。现在,得到一个自动完成列表中,我们简单地查询索引的单词列表。

The problem is quite similar to one I have already answered: We need to extract every word out of multiple fields, remove the stop words and save the remaining words together with a link to the respective document(s) the word was found in a collection. Now, for getting an autocompletion list, we simply query the indexed word list.

db.yourCollection.mapReduce(
  // Map function
  function() {

    // We need to save this in a local var as per scoping problems
    var document = this;

    // You need to expand this according to your needs
    var stopwords = ["the","this","and","or"];

    for(var prop in document) {

      // We are only interested in strings and explicitly not in _id
      if(prop === "_id" || typeof document[prop] !== 'string') {
        continue
      }

      (document[prop]).split(" ").forEach(
        function(word){

          // You might want to adjust this to your needs
          var cleaned = word.replace(/[;,.]/g,"")

          if(
            // We neither want stopwords...
            stopwords.indexOf(cleaned) > -1 ||
            // ...nor string which would evaluate to numbers
            !(isNaN(parseInt(cleaned))) ||
            !(isNaN(parseFloat(cleaned)))
          ) {
            return
          }
          emit(cleaned,document._id)
        }
      ) 
    }
  },
  // Reduce function
  function(k,v){

    // Kind of ugly, but works.
    // Improvements more than welcome!
    var values = { 'documents': []};
    v.forEach(
      function(vs){
        if(values.documents.indexOf(vs)>-1){
          return
        }
        values.documents.push(vs)
      }
    )
    return values
  },

  {
    // We need this for two reasons...
    finalize:

      function(key,reducedValue){

        // First, we ensure that each resulting document
        // has the documents field in order to unify access
        var finalValue = {documents:[]}

        // Second, we ensure that each document is unique in said field
        if(reducedValue.documents) {

          // We filter the existing documents array
          finalValue.documents = reducedValue.documents.filter(

            function(item,pos,self){

              // The default return value
              var loc = -1;

              for(var i=0;i<self.length;i++){
                // We have to do it this way since indexOf only works with primitives

                if(self[i].valueOf() === item.valueOf()){
                  // We have found the value of the current item...
                  loc = i;
                  //... so we are done for now
                  break
                }
              }

              // If the location we found equals the position of item, they are equal
              // If it isn't equal, we have a duplicate
              return loc === pos;
            }
          );
        } else {
          finalValue.documents.push(reducedValue)
        }
        // We have sanitized our data, now we can return it        
        return finalValue

      },
    // Our result are written to a collection called "words"
    out: "words"
  }
)

运行针对你的榜样此MA preduce会导致 db.words 是这样的:

    { "_id" : "can", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "canada", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "candid", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "candle", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "candy", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "cannister", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
    { "_id" : "canvas", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }

请注意,个人的话是 _id 的文件。在 _id 字段是MongoDB的自动索引。由于指数试图将保存在RAM中,我们可以做一些技巧既加快自动完成,减少负载放置到服务器。

Note that the individual words are the _id of the documents. The _id field is indexed automatically by MongoDB. Since indices are tried to be kept in RAM, we can do a few tricks to both speed up autocompletion and reduce the load put to the server.

有关自动完成,我们只需要的话,没有链接到文件。
自言被索引,我们使用覆盖查询 - 一个查询只回答该指数通常驻留在内存中。

For autocompletion, we only need the words, without the links to the documents. Since the words are indexed, we use a covered query – a query answered only from the index, which usually resides in RAM.

要坚持自己的例子,我们可以使用下面的查询,以获得候选人自动完成:

To stick with your example, we would use the following query to get the candidates for autocompletion:

db.words.find({_id:/^can/},{_id:1})

这给我们的结果。

which gives us the result

    { "_id" : "can" }
    { "_id" : "canada" }
    { "_id" : "candid" }
    { "_id" : "candle" }
    { "_id" : "candy" }
    { "_id" : "cannister" }
    { "_id" : "canteen" }
    { "_id" : "canvas" }

使用 .explain()方法,我们可以验证这个查询仅使用索引。

Using the .explain() method, we can verify that this query uses only the index.

        {
        "cursor" : "BtreeCursor _id_",
        "isMultiKey" : false,
        "n" : 8,
        "nscannedObjects" : 0,
        "nscanned" : 8,
        "nscannedObjectsAllPlans" : 0,
        "nscannedAllPlans" : 8,
        "scanAndOrder" : false,
        "indexOnly" : true,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 0,
        "indexBounds" : {
            "_id" : [
                [
                    "can",
                    "cao"
                ],
                [
                    /^can/,
                    /^can/
                ]
            ]
        },
        "server" : "32a63f87666f:27017",
        "filterSet" : false
    }

请注意在 indexOnly:真正的字段

虽然我们将不得不做两个查询,以获得实际的文件,因为我们加快整个过程中,用户体验应该是不够好。

Albeit we will have to do two queries to get the actual document, since we speed up the overall process, the user experience should be well enough.

当用户选择自动完成的选择,我们要查询的单词完整的文档,以便找到在那里选择了自动完成这个词起源于文件。

When the user selects a choice of the autocompletion, we have to query the complete document of words in order to find the documents where the word chosen for autocompletion originated from.

db.words.find({_id:"canteen"})

这会导致这样一个文件:

which would result in a document like this:

{ "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }

步骤3.2:获取实际的文档

使用该文件,我们现在可以既显示与搜索结果页面或,如在这种情况下,重定向到实际文件而可以得到由

Step 3.2: Get the actual document

With that document, we can now either show a page with search results or, like in this case, redirect to the actual document which you can get by:

db.yourCollection.find({_id:ObjectId("553e435f20e6afc4b8aa0efb")})

注释

虽然这种方法一开始会觉得复杂(当然,马preduce的的一点),它是实际的pretty容易概念。基本上,你所交易的实时结果(你不会有呢,除非你花了 很多的RAM)的速度。恕我直言,这是一个很好的协议。为了使相当昂贵MA preduce阶段更高效,实施增量MA preduce 可能的方法 - 提高我承认砍死马preduce很可能是另一个

Notes

While this approach may seem complicated at first (well, the mapReduce is a bit), it is actual pretty easy conceptually. Basically, you are trading real time results (which you won't have anyway unless you spend a lot of RAM) for speed. Imho, that's a good deal. In order to make the rather costly mapReduce phase more efficient, implementing Incremental mapReduce could be an approach – improving my admittedly hacked mapReduce might well be another.

最后但并非最不重要的,这种方式是一个相当丑陋的黑客干脆。你可能希望深入elasticsearch或Lucene的。这些产品都恕我直言得多,更适合你想要的东西。

Last but not least, this way is a rather ugly hack altogether. You might want to dig into elasticsearch or lucene. Those products imho are much, much more suited for what you want.

这篇关于实现使用MongoDB的搜索自动完成功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆