仅显示MongoDB文本搜索的匹配字段 [英] Show only matching fields for MongoDB text search

查看:143
本文介绍了仅显示MongoDB文本搜索的匹配字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Mongo的新手,并且希望为Web前端实现文本搜索功能。我在文本索引中的集合中添加了所有文本字段,因此搜索会在所有字段中找到匹配项。文件可能很重。



问题是,当我收到整个匹配文件而不仅仅是匹配字段。我只想得到匹配的字段以及文档 _id ,所以我可以在Web类型提示中提供一些提示,并且当用户选择匹配时,我可以通过 _id 加载整个文档。



有一个 $ project code>运算符,但问题是我不知道匹配将出现在哪个文本字段中。

解决方案


经过很长一段时间的思考之后,我认为可以实现你想要的。但是,它不适用于非常大的数据库,我还没有制定出增量方法。它缺少词干,停用词必须手动定义。



这个想法是使用mapReduce创建一个搜索词集合,引用原始文档和字段搜索词源自哪里。然后,对于自动完成的实际查询通过使用索引的简单聚合来完成,因此应该相当快。


$ b

  {
name:John F. Kennedy,
地址:Kenson Street 1,12345 Footown,TX,USA,
note:喜欢Kendo和Sushi
}

{
name:Robert F. Kennedy,
address:High Street 1,54321 Bartown,FL,USA,
注释:爱Ethel和雪茄

$ / code>

  {
name:Robert F. Sushi,
address:Sushi Street 1,54321 Bartown,FL,USA ,
note:喜欢寿司和更多寿司
}

在一个名为 textsearch

地图/缩小阶段



我们基本上做的是我们会处理每一个在三个字段中的一个字中,删除停用词和数字,并将每个单词与文档的 _id 以及出现在中间表中的字段保存在一起。



注释的代码:
$ b

  db.textsearch .mapReduce(
function(){

//我们需要根据作用域问题将其保存在本地变量中
var document = this;

//你需要根据你的需要来扩展它
var stopwords = [the,this,and,or];

//这表示应该处理的字段
var fields = [name,address,note];

//对于每个字段...
fields.forEach(

函数(字段){

// ...我们将字段分成单个字...
var words =(document [field])。split();

words.forEach(

函数(字){
// ...并删除不需要的字符
//请注意,这个正则表达式可能需要增强
var cleared = word.replace(/ [; ,...] / g,)

//接下来我们检查...
if(
// ...更新当前单词在停用词列表中, ...
(stopwords.indexOf(word)> -1)||

// ...是一个浮点数或一个整数...
!( isNaN(parseInt(已清理)))||
!(isNaN(parseFloat(已清理)))||

//或者只是一个字符
已清理.length< ; 2

{
//在a中在这些案件中,我们不想在我们的清单中包含当前的词。
return
}
// //否则,我们希望处理当前单词。
//注意我们必须使用一个multikey id和一个静态字段以
//的顺序来克服MongoDB的mapReduce限制之一:
//它不能将多个值赋给一个键。
emit({'word':clean,'doc':document._id,'field':field},1)

}

}

}
函数(键值){

//我们总结每个字段中每个字的每个出现次数
// //文件...
返回Array.sum(values);
},
// ..并将结果写入集合
{out:searchtst}

运行这将导致创建集合 searchtst 。如果它已经存在,它的所有内容都将被替换。



它看起来像这样:

  {_id:{word:Bartown,doc:ObjectId(544b9811fd9270c1492f5835),field:address}, value:1} 
{_id:{word:Bartown,doc:ObjectId(544bb320fd9270c1492f583c),field:address},value:1}
{word:Ethel,doc:ObjectId(544b9811fd9270c1492f5835),field:note},value:1}
{ _id:{word:FL,doc:ObjectId(544b9811fd9270c1492f5835),field:address},value:1}
{_id:{word :FL,doc:ObjectId(544bb320fd9270c1492f583c),field:address},value:1}
{_id:{word:Footown, doc:ObjectId(544b7e44fd9270c1492f5834),field:address},value:1}
[...]
{_id:{word:寿司,doc:ObjectId(544bb320fd9270c1492f583c),field:name},value: 1}
{_id:{word:Sushi,doc:ObjectId(544bb320fd9270c1492f583c),field:note},value:2}
[...]

这里有几点需要注意。首先,一个词可以有多个出现,例如FL。但是,它可能在不同的文件中,因为它是这种情况。另一方面,单词也可以在单个文档的单个字段中出现多次。第二,我们有所有的字段,最着名的是字段在 _id 的复合索引中,这会使即将到来的查询变得非常快。然而,这也意味着该指数将会非常大,并且 - 对于所有指数 - 往往会吃掉RAM。

聚合阶段



所以我们减少了单词列表。现在我们查询一个(子)字符串。
我们需要做的是找到以用户输入的字符串开头的所有单词,并返回与该字符串匹配的单词列表。为了能够做到这一点,并以适合我们的形式获得结果,我们使用了一个聚合。



这种聚合应该非常快,因为所有必要的要查询的字段是复合索引的一部分。



以下是用户输入字母 S $ b

  db.searchtst.aggregate(
//我们匹配不区分大小写(i),因为我们要防止
//拼写错误以减少搜索结果
{$ match:{_ id.word:/ ^ S / i}},
{$ group:{
//这里是魔术发生的地方:
//我们创建一个不同单词的列表...
_id:$ _ id.word,
发生次数:{
// ...将每个事件添加到数组...
$ push:{
doc:$ _ id.doc,
field: $ _id.field
}
},
// ...并将所有匹配项添加到一个分数
//注意这个是可选的,可能会跳过
//以加快速度,因为我们应该有一个覆盖查询
//在不访问$ value的时候,尽管我不太确定
的分数: {$ sum:$ value}
}
},
{
//可选。见上面
$ sort:{_ id:-1,score:1}
}

这个查询的结果看起来像这样,并且应该是不言自明的:

  {
_id:Sushi,
occurences:[
{doc:ObjectId(544b7e44fd9270c1492f5834),field:note },
{doc:ObjectId(544bb320fd9270c1492f583c),field:address},
{doc:ObjectId(544bb320fd9270c1492f583c),field: },
{doc:ObjectId(544bb320fd9270c1492f583c),field:note}
],
score:5
}
{
_id:Street,
occurences:[
{doc:ObjectId(544b7e44fd9270c1492f5834),field:address},
{doc:ObjectId(544b9811fd9270c1492f5835),field:address},
{doc:ObjectId(544bb320fd9270c1492f583c),field:address}
],
得分:3
}

Sushi的分数为5,这是因为Sushi这个词在其中一个文档的注释字段中出现了两次。这是预期的行为。



虽然这可能是一个穷人的解决方案,但需要针对无数可想象的用例进行优化,并且需要增量mapReduce来实现为了在生产环境中中途有用,它按预期工作。 hb。

编辑



当然,可以将 $ match stage并在聚合阶段添加一个 $ out 阶段,以便对结果进行预处理:

db.searchtst.aggregate(
{
$ group:{
_id: $ _id.word,
得分:{$ push:{doc:$ _ id.doc,field:$ _ id.field}},
得分:{$ sum:$值}
}
},{
$ out:search
})

现在,我们可以查询生成的搜索集合以加快速度。基本上你可以交换速度的实时结果。

编辑2 :如果采用预处理方法, searchtst为了节省磁盘空间和 - 更重要的是 - 珍贵的RAM,


应该在汇总完成后删除示例集合。

I am new to Mongo, and wanted to implement text search functionality for a Web front-end. I have added all text fields in a collection in the "text" index, so search finds a match in all the fields. Documents may be heavy.

The problem is that when I receive the whole matching documents and not just the matching fields. I want to get only the matching fields along with the document _id, so I can present just a hints in the Web type-ahead, and when the user selects a match, I can load the whole document by the _id.

There is a $project operator, but the problem is that I don't know which of the text fields the match will appear.

解决方案

After thinking about this a long time, I think it is possible to implement what you want. However, it is not suitable for very large databases and I haven't worked out an incremental approach yet. It lacks stemming and stop words have to be defined manually.

The idea is to use mapReduce to create a collection of search words with references to the document of origin and the field where the search word originated from. Then, for the actual query for the autocompletion is done using a simple aggregation which utilizes an index and therefor should be rather fast.

So we will work with the following three documents

{
  "name" : "John F. Kennedy",
  "address" : "Kenson Street 1, 12345 Footown, TX, USA",
  "note" : "loves Kendo and Sushi"
}

and

{
  "name" : "Robert F. Kennedy",
  "address" : "High Street 1, 54321 Bartown, FL, USA",
  "note" : "loves Ethel and cigars"
}

and

{
  "name" : "Robert F. Sushi",
  "address" : "Sushi Street 1, 54321 Bartown, FL, USA",
  "note" : "loves Sushi and more Sushi"
}

in a collection called textsearch.

The map/reduce stage

What we basically do is that we will process each and every word in one of the three fields, remove stop words and numbers and save each and every word with the document's _id and the field of the occurrence in an intermediate table.

The annotated code:

db.textsearch.mapReduce(
  function() {

    // We need to save this in a local var as per scoping problems
    var document = this;

    // You need to expand this according to your needs
    var stopwords = ["the","this","and","or"];

    // This denotes the fields which should be processed
    var fields = ["name","address","note"];

    // For each field...
    fields.forEach(

      function(field){

        // ... we split the field into single words...
        var words = (document[field]).split(" ");

        words.forEach(

          function(word){
            // ...and remove unwanted characters.
            // Please note that this regex may well need to be enhanced
            var cleaned = word.replace(/[;,.]/g,"")

            // Next we check...
            if(
              // ...wether the current word is in the stopwords list,...
              (stopwords.indexOf(word)>-1) ||

              // ...is either a float or an integer... 
              !(isNaN(parseInt(cleaned))) ||
              !(isNaN(parseFloat(cleaned))) ||

              // or is only one character.
              cleaned.length < 2
            )
            {
              // In any of those cases, we do not want to have the current word in our list.
              return
            }
              // Otherwise, we want to have the current word processed.
              // Note that we have to use a multikey id and a static field in order
              // to overcome one of MongoDB's mapReduce limitations:
              // it can not have multiple values assigned to a key.
              emit({'word':cleaned,'doc':document._id,'field':field},1)

          }
        )
      }
    )
  },
  function(key,values) {

    // We sum up each occurence of each word
    // in each field in every document...
    return Array.sum(values);
  },
    // ..and write the result to a collection
  {out: "searchtst" }
)

Running this will result in the creation of the collection searchtst. If it already existed, all of it's contents will be replaced.

It will look something like this:

{ "_id" : { "word" : "Bartown", "doc" : ObjectId("544b9811fd9270c1492f5835"), "field" : "address" }, "value" : 1 }
{ "_id" : { "word" : "Bartown", "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "address" }, "value" : 1 }
{ "_id" : { "word" : "Ethel", "doc" : ObjectId("544b9811fd9270c1492f5835"), "field" : "note" }, "value" : 1 }
{ "_id" : { "word" : "FL", "doc" : ObjectId("544b9811fd9270c1492f5835"), "field" : "address" }, "value" : 1 }
{ "_id" : { "word" : "FL", "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "address" }, "value" : 1 }
{ "_id" : { "word" : "Footown", "doc" : ObjectId("544b7e44fd9270c1492f5834"), "field" : "address" }, "value" : 1 }
[...]
{ "_id" : { "word" : "Sushi", "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "name" }, "value" : 1 }
{ "_id" : { "word" : "Sushi", "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "note" }, "value" : 2 }
[...]

There are a few things to note here. First of all, a word can have multiple occurrences, for example with "FL". However, it may be in different documents, as it is the case here. A word can also have multiple occurrences in a single field of a single document, on the other hand. We will use this to our advantage later.

Second, we have all the fields, most notably the wordfield in a compound index for _id, which should make the coming queries pretty fast. However, this also means the the index will be quite large and – as for all indices – tends to eat up RAM.

The aggregation stage

So we have reduced the list of words. Now we query for a (sub)string. What we need to do is to find all words beginning with the string the user typed in so far, returning a list of words matching that string. In order to be able to do this and to get the results in a form suitable for us, we use an aggregation.

This aggregation should be pretty fast, since all necessary fields to query are part of a compound index.

Here is the annotated aggregation for the case when the user typed in the letter S:

db.searchtst.aggregate(
  // We match case insensitive ("i") as we want to prevent
  // typos to reduce our search results
  { $match:{"_id.word":/^S/i} },
  { $group:{
      // Here is where the magic happens:
      // we create a list of distinct words...
      _id:"$_id.word",
      occurrences:{
        // ...add each occurrence to an array...
        $push:{
          doc:"$_id.doc",
          field:"$_id.field"
        } 
      },
      // ...and add up all occurrences to a score
      // Note that this is optional and might be skipped
      // to speed up things, as we should have a covered query
      // when not accessing $value, though I am not too sure about that
      score:{$sum:"$value"}
    }
  },
  {
    // Optional. See above
    $sort:{_id:-1,score:1}
  }
)

The result of this query looks something like this and should be pretty self-explanatory:

{
  "_id" : "Sushi",
  "occurences" : [
    { "doc" : ObjectId("544b7e44fd9270c1492f5834"), "field" : "note" },
    { "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "address" },
    { "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "name" },
    { "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "note" }
  ],
  "score" : 5
}
{
  "_id" : "Street",
  "occurences" : [
    { "doc" : ObjectId("544b7e44fd9270c1492f5834"), "field" : "address" },
    { "doc" : ObjectId("544b9811fd9270c1492f5835"), "field" : "address" },
    { "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "address" }
  ],
  "score" : 3
}

The score of 5 for Sushi comes from the fact that the word Sushi occurs twice in the note field of one of the documents. This is intended behavior.

While this may be a poor man's solution, needs to be optimized for the myriads of thinkable use cases and would need a incremental mapReduce to be implemented in order to be halfway useful in production environments, it works as expected. hth.

Edit

Of course, one could drop the $match stage and add an $out stage in the aggregation phase in order to have the results preprocessed:

db.searchtst.aggregate(
  {
    $group:{
      _id:"$_id.word",
      occurences:{ $push:{doc:"$_id.doc",field:"$_id.field"}},
      score:{$sum:"$value"}
     }
   },{
     $out:"search"
   })

Now, we can query the resulting search collection in order to speed things up. Basically you trade real time results for speed.

Edit 2: In case the preprocessing approach is taken, the searchtst collection of the example should be deleted after the aggregation is finished in order to save both disk space and – more important – precious RAM.

这篇关于仅显示MongoDB文本搜索的匹配字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆