对于MongoDB文本索引,词干不能正常工作 [英] Stemming does not work properly for MongoDB text index

查看:117
本文介绍了对于MongoDB文本索引,词干不能正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用MongoDB的全文搜索功能,并观察一些意外的行为。该问题与文本索引功能的词干方面有关。在网上的很多文章中描述了全文搜索的方式,如果在作为文本索引一部分的文档字段中有一个字符串大猎狗,则应该能够搜索搜索或狩猎如狗或狗。 MongoDB应该在编制索引时以及在搜索时标准化或者干扰文本。因此,在我的例子中,我希望它能够在索引中保存单词dog和hunt,并搜索这些单词的词干版本。如果我搜索狩猎,MongoDB应该搜索狩猎。



好吧,这不是它对我的作用。我在Linux上运行MongoDB 2.4.8并启用了全文搜索。如果我的记录有价值大狩猎犬,只搜索大会产生结果,而搜索狩猎或狗没有产生任何结果。就好像没有处于标准化形式的单词不存储在索引文本中(或以无法找到它们的方式存储)。使用$ regex操作符进行搜索可以很好地工作,那就是我可以通过在/ hunting /字段中搜索字符串来查找文档。



我试过放弃并重新创建全文索引 - 没有任何改变。我只能找到包含正常形式的文字的文件。搜索狗或狩猎(甚至是狗或狩猎)这样的词不会产生任何结果。我是否误解或误用了全文搜索操作还是在MongoDB中有bug?

解决方案

Michael, b 语言字段(如果有的话)允许每个文档覆盖将完成单词词干的

语言。我认为,如果您为MongoDB指定了一种它无法识别的语言(ENG),而
则根本无法阻止这些词语。正如其他人指出的那样,您可以使用

language_override 选项来指定MongoDB应该使用一些

其他字段目的(称为lang)而不是默认的(语言)。


下面是一个很好的报价(关于全文索引和搜索),其中
与您的问题完全相关。它来自这本书。


MongoDB:权威指南,第二版

以其他语言搜索


插入文档时(或者首先创建索引),MongoDB查看
索引字段并将每个单词分词,将其减少到重要单位。但是,不同的
语言以不同的方式表达词汇,因此您必须指定索引
或文档所使用的语言。因此,文本类型索引允许指定default_language选项为
,默认为english,但可以设置为其他语言的数量
(请参阅联机文档以了解最新日期列表)。
例如,要创建法语索引,我们可以说:

> db.users.ensureIndex({profil:text,interets:text},{default_language:french})


除非另有说明,否则法语将用于词干。您可以在
每个文档的基础上,通过描述文档语言的语言字段
指定另一种词干语言:

> db.users.insert({username:swedishChef,profile:Bork de bork,语言:swedish})



本书没有提及(至少这个页面没有提到)的是
可以使用 language_override 选项来指定MongoDB

应该为此使用其他字段(例如lang)并且

不是默认字段(语言)。

I am trying to use full text search feature of MongoDB and observing some unexpected behavior. The problem is related to "stemming" aspect of the text indexing feature. The way full text search is described in many articles online, if you have a string "big hunting dogs" in a document's field that is part of the text index, you should be able to search on "hunt" or "hunting" as well as on "dog" or "dogs". MongoDB should normalize or stem the text when indexing and also when searching. So in my example, I would expect it to save words "dog" and "hunt" in the index and search for a stemmed version of this words. If I search for "hunting", MongoDB should search for "hunt".

Well, this is not how it works for me. I am running MongoDB 2.4.8 on Linux with full text search enabled. If my record has value "big hunting dogs", only searching for "big" will produce the result, while searches for "hunt" or "dog" produce nothing. It is as if the words that are not in their "normalized" form are not stored in the text the index (or stored in a way it cannot find them). Searches using $regex operator work fine, that is I am able to find the document by searching on a string like /hunting/ against the field in question.

I tried dropping and recreating the full text index - nothing changed. I can only find the documents containing the words on their "normal" form. Searching for words like "dogs" or "hunting" (or even "dog" or "hunt") produces no results.

Do I misunderstand or misuse the full text search operations or is there a bug in MongoDB?

解决方案

Michael,

The "language" field (if present) allows each document to override the
language in which the stemming of words would be done. I think, as
you specified to MongoDB a language which it didn't recognize ("ENG"),
it was unable to stem the words at all. As others pointed out, you can use the
language_override option to specify that MongoDB should be using some
other field for this purpose (say "lang") and not the default one ("language").

Below is a nice quote (about full text indexing and searching) which
is exactly related to your issue. It is taken from this book.

"MongoDB: The Definitive Guide, 2nd Edition"

Searching in Other Languages

When a document is inserted (or the index is first created), MongoDB looks at the indexes fields and stems each word, reducing it to an essential unit. However, different languages stem words in different ways, so you must specify what language the index or document is. Thus, text-type indexes allow a "default_language" option to be specified, which defaults to "english" but can be set to a number of other languages (see the online documentation for an up-to-date list). For example, to create a French-language index, we could say:

> db.users.ensureIndex({"profil" : "text", "interets" : "text"}, {"default_language" : "french"})

Then French would be used for stemming, unless otherwise specified. You can, on a per-document basis, specify another stemming language by having a "language" field that describes the document’s language:

> db.users.insert({"username" : "swedishChef", "profile" : "Bork de bork", language : "swedish"})

What the book does not mention (at least this page of it doesn't) is that
one can use the language_override option to specify that MongoDB
should be using some other field for this purpose (say "lang") and
not the default one ("language").

这篇关于对于MongoDB文本索引,词干不能正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆