Django的草垛自动完成返回过宽结果 [英] django-haystack autocomplete returns too wide results

查看:196
本文介绍了Django的草垛自动完成返回过宽结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已创建了字段 title_auto 索引:

class GameIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, model_attr='title')
    title = indexes.CharField(model_attr='title')
    title_auto = indexes.NgramField(model_attr='title')

弹性的搜索设置是这样的:

Elastic search settings look like this:

ELASTICSEARCH_INDEX_SETTINGS = {
    'settings': {
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_ngram"],
                    "token_chars": ["letter", "digit"]
                },
                "edgengram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_edgengram"]
                }
            },
            "tokenizer": {
                "haystack_ngram_tokenizer": {
                    "type": "nGram",
                    "min_gram": 1,
                    "max_gram": 15,
                },
                "haystack_edgengram_tokenizer": {
                    "type": "edgeNGram",
                    "min_gram": 1,
                    "max_gram": 15,
                    "side": "front"
                }
            },
            "filter": {
                "haystack_ngram": {
                    "type": "nGram",
                    "min_gram": 1,
                    "max_gram": 15
                },
                "haystack_edgengram": {
                    "type": "edgeNGram",
                    "min_gram": 1,
                    "max_gram": 15
                }
            }
        }
    }
}

我尝试做自动完成搜索,它的工作原理,但返回了太多不相关的结果:

I try to do autocomplete search, it works, however returns too many irrelevant results:

QS = SearchQuerySet()。模型(游戏).autocomplete(title_auto = SEARCH_PHRASE)

QS = SearchQuerySet()。模型(游戏).filter(title_auto = SEARCH_PHRASE)

他们都产生相同的输出。

Both of them produce the same output.

如果SEARCH_PHRASE是垄断,第一结果中包含大富翁在他们的头衔,但是,由于只有2个相关项目,它将返回51.其他什么都没有做的大富翁。

If search_phrase is "monopoly", first results contain "Monopoly" in their titles, however, as there are only 2 relevant items, it returns 51. The others have nothing to do with "Monopoly" at all.

所以我的问题是 - 我怎么可以改变结果的相关性。

So my question is - how can I change relevance of the results?

推荐答案

这很难说肯定,因为我还没有看到完整的映射,但我怀疑的问题是,分析仪(其中之一)正在使用为索引和搜索。所以,当你的索引文件,获得创建和索引大量的ngram术语。如果你搜索和搜索文本进行了分析以同样的方式,获得大量产生的搜索词。由于您的最小的ngram是一个字母,pretty太多的查询是要配合很多文件。

It's hard to tell for sure since I haven't seen your full mapping, but I suspect the problem is that the analyzer (one of them) is being used for both indexing and searching. So when you index a document, lots of ngram terms get created and indexed. If you search and your search text is also analyzed the same way, lots of search terms get generated. Since your smallest ngram is a single letter, pretty much any query is going to match a lot of documents.

我们写了一篇博客文章有关使用n元为自动完成,你可能会发现有用的,此处的http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams.不过,我会给你一个简单的例子来说明我的意思。我不是超级熟悉草堆所以我可能不能帮你,但我可以解释用n元组的问题Elasticsearch。

We wrote a blog post about using ngrams for autocomplete that you might find helpful, here: http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams. But I'll give you a simpler example to illustrate what I mean. I'm not super familiar with haystack so I probably can't help you there, but I can explain the issue with ngrams in Elasticsearch.

首先,我将设置一个使用为索引和搜索的NGRAM分析仪的索引:

First I'll set up an index that uses an ngram analyzer for both indexing and searching:

PUT /test_index
{
   "settings": {
       "number_of_shards": 1,
      "analysis": {
         "filter": {
            "nGram_filter": {
               "type": "nGram",
               "min_gram": 1,
               "max_gram": 15,
               "token_chars": [
                  "letter",
                  "digit",
                  "punctuation",
                  "symbol"
               ]
            }
         },
         "analyzer": {
            "nGram_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "nGram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
        "doc": {
            "properties": {
                "title": {
                    "type": "string", 
                    "analyzer": "nGram_analyzer"
                }
            }
        }
   }
}

和添加一些文档:

PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"title":"monopoly"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"title":"oligopoly"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"title":"plutocracy"}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"title":"theocracy"}
{"index":{"_index":"test_index","_type":"doc","_id":5}}
{"title":"democracy"}

和运行一个简单的匹配搜索

and run a simple match search for "poly":

POST /test_index/_search
{
    "query": {
        "match": {
           "title": "poly"
        }
    }
}

它返回的所有5个文件:

it returns all five documents:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 5,
      "max_score": 4.729521,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 4.729521,
            "_source": {
               "title": "oligopoly"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 4.3608603,
            "_source": {
               "title": "monopoly"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 1.0197333,
            "_source": {
               "title": "plutocracy"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "4",
            "_score": 0.31496215,
            "_source": {
               "title": "theocracy"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "5",
            "_score": 0.31496215,
            "_source": {
               "title": "democracy"
            }
         }
      ]
   }
}

这是因为搜索词被表征为条件POLY,其中,因为每个文件的标题字段被表征为单字母而言,相匹配的所有文件。

This is because the search term "poly" gets tokenized into the terms "p", "o", "l", and "y", which, since the "title" field in each of the documents was tokenized into single-letter terms, matches every document.

如果我们重建这个映射指数替代(同分析仪和文档):

If we rebuild the index with this mapping instead (same analyzer and docs):

"mappings": {
  "doc": {
     "properties": {
        "title": {
           "type": "string",
           "index_analyzer": "nGram_analyzer",
           "search_analyzer": "standard"
        }
     }
  }
}

查询将返回我们所期望的:

the query will return what we expect:

POST /test_index/_search
{
    "query": {
        "match": {
           "title": "poly"
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 1.5108256,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 1.5108256,
            "_source": {
               "title": "monopoly"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 1.5108256,
            "_source": {
               "title": "oligopoly"
            }
         }
      ]
   }
}

边缘n元组的工作类似,除了开始于将被使用的词语开头,只有术语

Edge ngrams work similarly, except that only terms that start at the beginning of the words will be used.

下面是code我用这个例子:

Here is the code I used for this example:

http://sense.qbox.io/gist/b24cbc531b483650c085a42963a49d6a23fa5579

这篇关于Django的草垛自动完成返回过宽结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆