python模糊文本搜索 [英] python fuzzy text search

查看：438 发布时间：2017/8/7 0:46:12 python elasticsearch full-text-search fuzzy-search whoosh

本文介绍了python模糊文本搜索的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道是否有任何Python库可以进行模糊文本搜索。例如：

我有三个关键词letter，stamp和mail code>。

 
 我想有一个功能来检查这三个字是否在同一段落的
内（或一定距离，一页）。 
 
 此外，这些单词必须保持相同的顺序。没有其他单词出现在这三个单词之间。

 
 
 我尝试过 fuzzywuzzy 这没有解决我的问题。另一个库 Whoosh 看起来很强大，但我没有找到正确的函数...

解决方案

   {1}  
您可以在 Whoosh 2.7 中执行此操作。它通过添加插件 whoosh.qparser.FuzzyTermPlugin 进行模糊搜索：
 
   whoosh.qparser.FuzzyTermPlugin 可以让您搜索模糊术语，即不需要完全匹配的术语。模糊项将匹配任何类似术语在一定数量的编辑（字符插入，删除和/或转置 - 这被称为Damerau-Levenshtein编辑距离）。
 
 
添加模糊插件：
  parser = qparser.QueryParser（ fieldname，my_index.schema）
 parser.add_plugin（qparser.FuzzyTermPlugin（））
  
将模糊插件添加到解析器后，您可以通过添加〜指定一个模糊项，后跟可选的最大编辑距离。如果您没有指定编辑距离，则默认值为1。
 
 
 例如，以下模糊术语查询：
  letter〜
 letter〜2 
 letter〜2/3 
  
 
 
 
 
 
   {2} 要保持字词顺序，请使用查询 whoosh。 query.Phrase ，但您应该通过 whoosh.qparser.SequencePlugin 替换 Phrase 您在短语内使用模糊词语：
 letter〜stamp〜mail〜
  
要用序列插件替换默认短语插件：

 code> parser = qparser.QueryParser（fieldname，my_index.schema）
 parser.remove_plugin_class（qparser.PhrasePlugin）
 parser.add_plugin（qparser.SequencePlugin（））

{3} 将短语查询中的斜线初始化为

  whoosh.query.Phrase（fieldname，words，slop = 1，boost = 1.0，char_ranges = None）

slop - 短语中每个单词之间允许的单词数;默认值为1表示短语必须完全匹配。

您还可以在查询中定义这样的斜率：

 letter〜stamp〜mail〜〜10

 
 
   {4} 整体解决方案：
 
 
   { 4.a}  索引器将如下：
  from whoosh.index import create_in 
 from whoosh.fields import * 
 
 schema = Schema（title = TEXT（stored = True），content = TEXT）
 ix = create_in（indexdir，schema）
 writer = ix.writer（）
 writer.add_document（title = u第一个文档，content = u这是我们添加的第一个文档！）
 writer.add_document title = u第二个文件，content = u第二个更有趣！）
 writer.add_document（title = u第三个文档，content = u ）
 writer.add_document（title = uFourth document，content = ustamp first，mail third）
 writer.add_do cument（title = uFivth document，content = uletter first，mail third）
 writer.add_document（title = u第六个文档，内容= u第一个，第二个，第三个错误）
 writer.add_document（title = u第七个文档，content = ustamp first，letters second，mail third）
 writer.commit（）
  
  {4.b}  搜索者将如下：
  from whoosh.qparser import QueryParser，FuzzyTermPlugin，PhrasePlugin，SequencePlugin 
 
与ix.searcher（）作为搜索者：
 parser = QueryParser（ucontent，ix.schema）
 parser.add_plugin（FuzzyTermPlugin（））
 parser.remove_plugin_class（PhrasePlugin）
 parser.add_plugin（SequencePlugin（））
 query = parser.parse（u\letter〜2 stamp〜2 mail〜2\〜10）
 results = searcher.search（query）
 print =，len（results）
 for r in results：
 print r 
  
钍给出结果：
  nb of results = 2 
< Hit {'title'：u'Sixth document }> 
< Hit {'title'：u'Third document'}> 
  
 
 
 
 
 
   {5} 如果要将模糊搜索设置为默认值，而不使用查询每个单词中的语法 word〜n ，则可以初始化 QueryParser 如下：
  from whoosh.query import FuzzyTerm 
 parser = QueryParser（ucontent，ix .schema，termclass = FuzzyTerm）
  
现在可以使用查询邮票〜10 但请记住， FuzzyTerm 具有默认编辑距离 maxdist = 1 。个性化课程，如果你想要更大的编辑距离：
  class MyFuzzyTerm（FuzzyTerm）：
 def __init __（self，fieldname ，文本，boost = 1.0，maxdist = 2，prefixlength = 1，constantscore = True）
 super（D，self）.__ init __（fieldname，text，boost，maxdist，prefixlength，constantscore）
＃ super（）.__ init __（）for Python 3我认为
  
 
 
 
 
 
 参考文献：
 
   whoosh.query.Phrase  
 
  添加模糊项查询 
 
  允许复杂的短语查询 
 
  cl屁股whoosh.query.FuzzyTerm  
 
   qparser模块 
 
 
 
I am wondering if there has any Python library can conduct fuzzy text search. For example:


I have three key words "letter", "stamp", and "mail". 
I would like to have a function to check if those three words are within
the same paragraph (or certain distances, one page).   
In addition, those words have to maintain the same order. It is fine that other words appear between those three words.


I have tried fuzzywuzzy which did not solve my problem. Another library Whoosh looks powerful, but I did not find the proper function...
 解决方案 
{1}
You can do this in Whoosh 2.7. It has fuzzy search by adding the plugin whoosh.qparser.FuzzyTermPlugin:

  whoosh.qparser.FuzzyTermPlugin lets you search for "fuzzy" terms, that is, terms that don’t have to match exactly. The fuzzy term will match any similar term within a certain number of "edits" (character insertions, deletions, and/or transpositions – this is called the "Damerau-Levenshtein edit distance").
To add the fuzzy plugin:
parser = qparser.QueryParser("fieldname", my_index.schema)
parser.add_plugin(qparser.FuzzyTermPlugin())
Once you add the fuzzy plugin to the parser, you can specify a fuzzy term by adding a ~ followed by an optional maximum edit distance. If you don’t specify an edit distance, the default is 1.

For example, the following "fuzzy" term query:
letter~
letter~2
letter~2/3




{2} To keep words in order, use the Query whoosh.query.Phrase but you should replace Phrase plugin by whoosh.qparser.SequencePlugin that allows you to use fuzzy terms inside a phrase:
"letter~ stamp~ mail~"
To replace the default phrase plugin with the sequence plugin:
parser = qparser.QueryParser("fieldname", my_index.schema)
parser.remove_plugin_class(qparser.PhrasePlugin)
parser.add_plugin(qparser.SequencePlugin())




{3} To allow words between, initialize the slop arg in your Phrase query to a greater number:
whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)



  slop – the number of words allowed between each "word" in the phrase; the default of 1 means the phrase must match exactly.
You can also define slop in Query like this:
"letter~ stamp~ mail~"~10




{4} Overall solution:

{4.a} Indexer would be like:
from whoosh.index import create_in
from whoosh.fields import *

schema = Schema(title=TEXT(stored=True), content=TEXT)
ix = create_in("indexdir", schema)
writer = ix.writer()
writer.add_document(title=u"First document", content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", content=u"The second one is even more interesting!")
writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third")
writer.add_document(title=u"Fourth document", content=u"stamp first, mail third")
writer.add_document(title=u"Fivth document", content=u"letter first,  mail third")
writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong")
writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third")
writer.commit()
{4.b} Searcher would be like:
from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin

with ix.searcher() as searcher:
    parser = QueryParser(u"content", ix.schema)
    parser.add_plugin(FuzzyTermPlugin())
    parser.remove_plugin_class(PhrasePlugin)
    parser.add_plugin(SequencePlugin())
    query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10")
    results = searcher.search(query)
    print "nb of results =", len(results)
    for r in results:
        print r
That gives the result:
nb of results = 2
<Hit {'title': u'Sixth document'}>
<Hit {'title': u'Third document'}>




{5} If you want to set fuzzy search as default without using the syntax word~n in each word of the query, you can initialize QueryParser like this:
 from whoosh.query import FuzzyTerm
 parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm)
Now you can use the query "letter stamp mail"~10 but keep in mind that FuzzyTerm has default edit distance maxdist = 1. Personalize the class if you want bigger edit distance: 
class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
         super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore) 
         # super().__init__() for Python 3 I think




References:

whoosh.query.Phrase 
Adding fuzzy term queries 
Allowing complex phrase queries
class whoosh.query.FuzzyTerm
qparser module


                        
这篇关于python模糊文本搜索的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

python模糊文本搜索 [英] python fuzzy text search

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

python模糊文本搜索 [英] python fuzzy text search

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭