python模糊文本搜索 [英] python fuzzy text search

查看:438
本文介绍了python模糊文本搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有任何Python库可以进行模糊文本搜索。例如:




  • 我有三个关键词letter,stamp和mail code>。

  • 我想有一个功能来检查这三个字是否在同一段落的
    内(或一定距离,一页)。

  • 此外,这些单词必须保持相同的顺序。没有其他单词出现在这三个单词之间。



我尝试过 fuzzywuzzy 这没有解决我的问题。另一个库 Whoosh 看起来很强大,但我没有找到正确的函数...

解决方案

{1}
您可以在 Whoosh 2.7 中执行此操作。它通过添加插件 whoosh.qparser.FuzzyTermPlugin 进行模糊搜索:


whoosh.qparser.FuzzyTermPlugin 可以让您搜索模糊术语,即不需要完全匹配的术语。模糊项将匹配任何类似术语在一定数量的编辑(字符插入,删除和/或转置 - 这被称为Damerau-Levenshtein编辑距离)。


添加模糊插件:

  parser = qparser.QueryParser( fieldname,my_index.schema)
parser.add_plugin(qparser.FuzzyTermPlugin())

将模糊插件添加到解析器后,您可以通过添加指定一个模糊项,后跟可选的最大编辑距离。如果您没有指定编辑距离,则默认值为1。



例如,以下模糊术语查询:

  letter〜
letter〜2
letter〜2/3






{2} 要保持字词顺序,请使用查询 whoosh。 query.Phrase ,但您应该通过 whoosh.qparser.SequencePlugin 替换 Phrase 您在短语内使用模糊词语:

 letter〜stamp〜mail〜

要用序列插件替换默认短语插件:

 code> parser = qparser.QueryParser(fieldname,my_index.schema)
parser.remove_plugin_class(qparser.PhrasePlugin)
parser.add_plugin(qparser.SequencePlugin())






{3} 将短语查询中的斜线初始化为

  whoosh.query.Phrase(fieldname,words,slop = 1,boost = 1.0,char_ranges = None)




slop - 短语中每个单词之间允许的单词数;默认值为1表示短语必须完全匹配。


您还可以在查询中定义这样的斜率:

 letter〜stamp〜mail〜〜10 






{4} 整体解决方案:



{ 4.a} 索引器将如下:

  from whoosh.index import create_in 
from whoosh.fields import *

schema = Schema(title = TEXT(stored = True),content = TEXT)
ix = create_in(indexdir,schema)
writer = ix.writer()
writer.add_document(title = u第一个文档,content = u这是我们添加的第一个文档!)
writer.add_document title = u第二个文件,content = u第二个更有趣!)
writer.add_document(title = u第三个文档,content = u )
writer.add_document(title = uFourth document,content = ustamp first,mail third)
writer.add_do cument(title = uFivth document,content = uletter first,mail third)
writer.add_document(title = u第六个文档,内容= u第一个,第二个,第三个错误)
writer.add_document(title = u第七个文档,content = ustamp first,letters second,mail third)
writer.commit()

{4.b} 搜索者将如下:

  from whoosh.qparser import QueryParser,FuzzyTermPlugin,PhrasePlugin,SequencePlugin 

与ix.searcher()作为搜索者:
parser = QueryParser(ucontent,ix.schema)
parser.add_plugin(FuzzyTermPlugin())
parser.remove_plugin_class(PhrasePlugin)
parser.add_plugin(SequencePlugin())
query = parser.parse(u\letter〜2 stamp〜2 mail〜2\〜10)
results = searcher.search(query)
print =,len(results)
for r in results:
print r

钍给出结果:

  nb of results = 2 
< Hit {'title':u'Sixth document }>
< Hit {'title':u'Third document'}>






{5} 如果要将模糊搜索设置为默认值,而不使用查询每个单词中的语法 word〜n ,则可以初始化 QueryParser 如下:

  from whoosh.query import FuzzyTerm 
parser = QueryParser(ucontent,ix .schema,termclass = FuzzyTerm)

现在可以使用查询邮票〜10 但请记住, FuzzyTerm 具有默认编辑距离 maxdist = 1 。个性化课程,如果你想要更大的编辑距离:

  class MyFuzzyTerm(FuzzyTerm):
def __init __(self,fieldname ,文本,boost = 1.0,maxdist = 2,prefixlength = 1,constantscore = True)
super(D,self).__ init __(fieldname,text,boost,maxdist,prefixlength,constantscore)
# super().__ init __()for Python 3我认为






参考文献:


  1. whoosh.query.Phrase

  2. 添加模糊项查询

  3. 允许复杂的短语查询

  4. cl屁股whoosh.query.FuzzyTerm

  5. qparser模块


I am wondering if there has any Python library can conduct fuzzy text search. For example:

  • I have three key words "letter", "stamp", and "mail".
  • I would like to have a function to check if those three words are within the same paragraph (or certain distances, one page).
  • In addition, those words have to maintain the same order. It is fine that other words appear between those three words.

I have tried fuzzywuzzy which did not solve my problem. Another library Whoosh looks powerful, but I did not find the proper function...

解决方案

{1} You can do this in Whoosh 2.7. It has fuzzy search by adding the plugin whoosh.qparser.FuzzyTermPlugin:

whoosh.qparser.FuzzyTermPlugin lets you search for "fuzzy" terms, that is, terms that don’t have to match exactly. The fuzzy term will match any similar term within a certain number of "edits" (character insertions, deletions, and/or transpositions – this is called the "Damerau-Levenshtein edit distance").

To add the fuzzy plugin:

parser = qparser.QueryParser("fieldname", my_index.schema)
parser.add_plugin(qparser.FuzzyTermPlugin())

Once you add the fuzzy plugin to the parser, you can specify a fuzzy term by adding a ~ followed by an optional maximum edit distance. If you don’t specify an edit distance, the default is 1.

For example, the following "fuzzy" term query:

letter~
letter~2
letter~2/3


{2} To keep words in order, use the Query whoosh.query.Phrase but you should replace Phrase plugin by whoosh.qparser.SequencePlugin that allows you to use fuzzy terms inside a phrase:

"letter~ stamp~ mail~"

To replace the default phrase plugin with the sequence plugin:

parser = qparser.QueryParser("fieldname", my_index.schema)
parser.remove_plugin_class(qparser.PhrasePlugin)
parser.add_plugin(qparser.SequencePlugin())


{3} To allow words between, initialize the slop arg in your Phrase query to a greater number:

whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)

slop – the number of words allowed between each "word" in the phrase; the default of 1 means the phrase must match exactly.

You can also define slop in Query like this:

"letter~ stamp~ mail~"~10


{4} Overall solution:

{4.a} Indexer would be like:

from whoosh.index import create_in
from whoosh.fields import *

schema = Schema(title=TEXT(stored=True), content=TEXT)
ix = create_in("indexdir", schema)
writer = ix.writer()
writer.add_document(title=u"First document", content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", content=u"The second one is even more interesting!")
writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third")
writer.add_document(title=u"Fourth document", content=u"stamp first, mail third")
writer.add_document(title=u"Fivth document", content=u"letter first,  mail third")
writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong")
writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third")
writer.commit()

{4.b} Searcher would be like:

from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin

with ix.searcher() as searcher:
    parser = QueryParser(u"content", ix.schema)
    parser.add_plugin(FuzzyTermPlugin())
    parser.remove_plugin_class(PhrasePlugin)
    parser.add_plugin(SequencePlugin())
    query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10")
    results = searcher.search(query)
    print "nb of results =", len(results)
    for r in results:
        print r

That gives the result:

nb of results = 2
<Hit {'title': u'Sixth document'}>
<Hit {'title': u'Third document'}>


{5} If you want to set fuzzy search as default without using the syntax word~n in each word of the query, you can initialize QueryParser like this:

 from whoosh.query import FuzzyTerm
 parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm)

Now you can use the query "letter stamp mail"~10 but keep in mind that FuzzyTerm has default edit distance maxdist = 1. Personalize the class if you want bigger edit distance:

class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
         super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore) 
         # super().__init__() for Python 3 I think


References:

  1. whoosh.query.Phrase
  2. Adding fuzzy term queries
  3. Allowing complex phrase queries
  4. class whoosh.query.FuzzyTerm
  5. qparser module

这篇关于python模糊文本搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆