Django haystack EdgeNgramField给出了与elasticsearch不同的结果 [英] Django haystack EdgeNgramField given different results than elasticsearch

查看:381
本文介绍了Django haystack EdgeNgramField给出了与elasticsearch不同的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用弹性搜索后端运行haystack,现在我正在建立城市名称的自动填充。问题是,SearchQuerySet给我不同的结果,从我的角度来看,错误的是在弹性搜索中直接执行的查询,这对我来说是预期的结果。

I'm currently running haystack with an elasticsearch backend, and now I'm building an autocomplete for cities names. The problem is that SearchQuerySet is giving me different results, which from my perspective are wrong, than the same query executed directly in elasticsearch, which are for me the expected results.

我使用的是:Django 1.5.4,
django-haystack 2.1.0,
pyelasticsearch 0.6.1,
elasticsearch 0.90.3

I'm using: Django 1.5.4, django-haystack 2.1.0, pyelasticsearch 0.6.1, elasticsearch 0.90.3

使用以下示例数据:


  • Midfield

  • 米德兰市

  • 中途岛

  • 次要

  • Minturn

  • 迈阿密海滩

  • Midfield
  • Midland City
  • Midway
  • Minor
  • Minturn
  • Miami Beach

使用

SearchQuerySet().models(Geoname).filter(name_auto='mid')
or
SearchQuerySet().models(Geoname).autocomplete(name_auto='mid')

结果总是返回所有6个名字,包括Min *和Mia * ...但是,查询elasticsearch直接返回正确的数据:

The result returns always all the 6 names, including Min* and Mia*...however, querying elasticsearch directly returns the right data:

"query": {
    "filtered" : {
        "query" : {
            "match_all": {}
        },
        "filter" : {
             "term": {"name_auto": "mid"}
        }
    }
}

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 1,
      "hits": [
         {
            "_index": "haystack",
            "_type": "modelresult",
            "_id": "csi.geoname.4075977",
            "_score": 1,
            "_source": {
               "name_auto": "Midfield",
            }
         },
         {
            "_index": "haystack",
            "_type": "modelresult",
            "_id": "csi.geoname.4075984",
            "_score": 1,
            "_source": {
               "name_auto": "Midland City",
            }
         },
         {
            "_index": "haystack",
            "_type": "modelresult",
            "_id": "csi.geoname.4075989",
            "_score": 1,
            "_source": {
               "name_auto": "Midway",
            }
         }
      ]
   }
}

行为与不同的例子是一样的。我的猜测是,通过haystack的字符串,它被分割和所有可能的min_gram组的字符分析,这就是为什么它返回错误的结果。

The behavior is the same with different examples. My guess is that trough haystack the string it's being split and analyzed by all possible "min_gram" groups of characters and that's why it returns wrong results.

我不知道如果我在做或理解错误,如果这是干草堆应该是如何工作的,但是我需要这个干草堆结果与弹性搜索结果相匹配。

I'm not sure if I am doing or understanding something wrong, and if is this how haystack is supposed to work, but I need that haystack results match the elasticsearch results.

那么,我解决了这个问题或使其发挥作用?

So, How can I fix the issue or make it works ?

我的总结对象如下所示:

My summarized objects look as follow:

型号: p>

Model:

class Geoname(models.Model):
    id = models.IntegerField(primary_key=True)
    name = models.CharField(max_length=255)

索引:

class GeonameIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    name_auto = indexes.EdgeNgramField(model_attr='name')

    def get_model(self):
        return Geoname

映射:

modelresult: {
    _boost: {
        name: "boost",
        null_value: 1
    },
    properties: {
        django_ct: {
            type: "string"
        },
        django_id: {
            type: "string"
        },
        name_auto: {
            type: "string",
            store: true,
            term_vector: "with_positions_offsets",
            analyzer: "edgengram_analyzer"
        }
    }
}

谢谢。

推荐答案

深入了解代码,我发现由haystack生成的搜索是:

After a deep look into the code I found that the search generated by haystack was:

{
  "query":{
     "filtered":{
        "filter":{
           "fquery":{
              "query":{
                 "query_string":{
                    "query": "django_ct:(csi.geoname)"
                 }
              },
              "_cache":false
           }
        },
        "query":{
           "query_string":{
              "query": "name_auto:(mid)",
              "default_operator":"or",
              "default_field":"text",
              "auto_generate_phrase_queries":true,
              "analyze_wildcard":true
           }
        }
     }
  },
  "from":0,
  "size":6
}

在弹性搜索中运行这个查询是给我的结果是同样的6对象,干草堆显示...但如果我添加到query_string

Running this query in elasticsearch was given me as result the same 6 objects that haystack was showing...but If I added to the "query_string"

"analyzer": "standard"

它根据需要工作。所以这个想法是能够为该领域设置不同的搜索分析器。

it worked as desired. So the idea was to be able to setup a different search analyzer for the field.

根据@ user954994答案的链接和这篇文章,我最终做的是让它工作的是:

Based on the @user954994 answer's link and the explanation on this post, what I finally did to make it work was:


  1. 我创建了自定义弹性搜索后端,根据标准添加新的自定义分析器。

  2. 我添加了一个自定义EdgeNgramField,启用设置索引(index_analyzer)的特定分析器和另一个搜索分析器(search_analyzer)的方法。

所以,我的新设置是:

So, my new settings are:

ELASTICSEARCH_INDEX_SETTINGS = {
    'settings': {
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_ngram"]
                },
                "edgengram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_edgengram"]
                },
                "suggest_analyzer": {
                    "type":"custom",
                    "tokenizer":"standard",
                    "filter":[
                        "standard",
                        "lowercase",
                        "asciifolding"
                    ]
                },
            },
            "tokenizer": {
                "haystack_ngram_tokenizer": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 15,
                },
                "haystack_edgengram_tokenizer": {
                    "type": "edgeNGram",
                    "min_gram": 2,
                    "max_gram": 15,
                    "side": "front"
                }
            },
            "filter": {
                "haystack_ngram": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 15
                },
                "haystack_edgengram": {
                    "type": "edgeNGram",
                    "min_gram": 2,
                    "max_gram": 15
                }
            }
        }
    }
}

我的新自定义build_schema方法如下所示:

My new custom build_schema method looks as follow:

def build_schema(self, fields):
    content_field_name, mapping = super(ConfigurableElasticBackend,
                                          self).build_schema(fields)

    for field_name, field_class in fields.items():
        field_mapping = mapping[field_class.index_fieldname]

        index_analyzer = getattr(field_class, 'index_analyzer', None)
        search_analyzer = getattr(field_class, 'search_analyzer', None)
        field_analyzer = getattr(field_class, 'analyzer', self.DEFAULT_ANALYZER)

        if field_mapping['type'] == 'string' and field_class.indexed:
            if not hasattr(field_class, 'facet_for') and not field_class.field_type in('ngram', 'edge_ngram'):
                field_mapping['analyzer'] = field_analyzer

        if index_analyzer and search_analyzer:
            field_mapping['index_analyzer'] = index_analyzer
            field_mapping['search_analyzer'] = search_analyzer
            del(field_mapping['analyzer'])

        mapping.update({field_class.index_fieldname: field_mapping})
    return (content_field_name, mapping)

在重建索引后,我的映射看起来像belo w:

And after rebuild index my mapping looks as below:

modelresult: {
   _boost: {
       name: "boost",
       null_value: 1
   },
   properties: {
       django_ct: {
           type: "string"
       },
       django_id: {
           type: "string"
       },
       name_auto: {
           type: "string",
           store: true,
           term_vector: "with_positions_offsets",
           index_analyzer: "edgengram_analyzer",
           search_analyzer: "suggest_analyzer"
       }
   }
}

现在一切正常工作!

更新:

Bellow你会找到澄清这部分的代码:

Bellow you'll find the code to clarify this part:



  1. 我创建了我的自定义弹性搜索后端,根据标准添加新的自定义分析器。

  2. 我添加了一个自定义的EdgeNgramField,可以设置一个特定的分析器for index(index_analyzer)和另一个分析器
    search(search_analyzer)。


我的应用程序search_backends.py:

Into my app search_backends.py:

from django.conf import settings
from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend
from haystack.backends.elasticsearch_backend import ElasticsearchSearchEngine
from haystack.fields import EdgeNgramField as BaseEdgeNgramField


# Custom Backend 
class CustomElasticBackend(ElasticsearchSearchBackend):

    DEFAULT_ANALYZER = None

    def __init__(self, connection_alias, **connection_options):
        super(CustomElasticBackend, self).__init__(
                                connection_alias, **connection_options)
        user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS', None)
        self.DEFAULT_ANALYZER = getattr(settings, 'ELASTICSEARCH_DEFAULT_ANALYZER', "snowball")
        if user_settings:
            setattr(self, 'DEFAULT_SETTINGS', user_settings)

    def build_schema(self, fields):
        content_field_name, mapping = super(CustomElasticBackend,
                                              self).build_schema(fields)

        for field_name, field_class in fields.items():
            field_mapping = mapping[field_class.index_fieldname]

            index_analyzer = getattr(field_class, 'index_analyzer', None)
            search_analyzer = getattr(field_class, 'search_analyzer', None)
            field_analyzer = getattr(field_class, 'analyzer', self.DEFAULT_ANALYZER)

            if field_mapping['type'] == 'string' and field_class.indexed:
                if not hasattr(field_class, 'facet_for') and not field_class.field_type in('ngram', 'edge_ngram'):
                    field_mapping['analyzer'] = field_analyzer

            if index_analyzer and search_analyzer:
                field_mapping['index_analyzer'] = index_analyzer
                field_mapping['search_analyzer'] = search_analyzer
                del(field_mapping['analyzer'])

            mapping.update({field_class.index_fieldname: field_mapping})
        return (content_field_name, mapping)


class CustomElasticSearchEngine(ElasticsearchSearchEngine):
    backend = CustomElasticBackend


# Custom field
class CustomFieldMixin(object):

    def __init__(self, **kwargs):
        self.analyzer = kwargs.pop('analyzer', None)
        self.index_analyzer = kwargs.pop('index_analyzer', None)
        self.search_analyzer = kwargs.pop('search_analyzer', None)
        super(CustomFieldMixin, self).__init__(**kwargs)


class CustomEdgeNgramField(CustomFieldMixin, BaseEdgeNgramField):
    pass

我的索引定义如下:

class MyIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    name_auto = CustomEdgeNgramField(model_attr='name', index_analyzer="edgengram_analyzer", search_analyzer="suggest_analyzer")

最后,设置使用干草连接定义的自定义后端:

And finally, settings uses of course the custom backend for the haystack connection definition:

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'my_app.search_backends.CustomElasticSearchEngine',
        'URL': 'http://localhost:9200',
        'INDEX_NAME': 'index'
    },
}

这篇关于Django haystack EdgeNgramField给出了与elasticsearch不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆