带有数字标记的elasticsearch映射 [英] elasticsearch mapping with numeric token

查看:62
本文介绍了带有数字标记的elasticsearch映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在下面有映射,并且可以正常工作

I have the mapping below and it works normally

{
  "settings": {
    "index": {
      "number_of_shards": "5",
      "number_of_replicas": "0",
      "analysis": {
        "filter": {
          "stemmer_plural_portugues": {
            "name": "minimal_portuguese",
            "stopwords" : ["http", "https", "ftp", "www"],
            "type": "stemmer"
          },
          
          
            "synonym_filter": {
            "type": "synonym",
            "lenient": true,
            "synonyms_path": "analysis/synonym.txt",
            "updateable" : true

          },
          
       
          "shingle_filter": {
            "type": "shingle",
            "min_shingle_size": 2,
            "max_shingle_size": 3
          }

        },
        
        "analyzer": {
          "analyzer_customizado": {
            "filter": [
              "lowercase",
              "stemmer_plural_portugues",
              "asciifolding",
              "synonym_filter",
              "shingle_filter"
              
            ],
            "tokenizer": "lowercase"
          }
        }

      }
    }
  },
  "mappings": {
      "properties": {

        "id": {
         "type": "long"
        },
         "data": {
          "type": "date"
        },
         "quebrado": {
          "type": "byte"
          
        },
         "pgrk": {
           "type":  "integer" 
        },
         "url_length": {
           "type":  "integer" 
        },
        "title": {
          "analyzer": "analyzer_customizado",
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "description": {
        "analyzer": "analyzer_customizado",
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "url": {
          "analyzer": "analyzer_customizado",
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        }
      }
    }
  }

我在下面插入文档

{
    "title": "rocket 1960",
    "description": "space",
    "url": "www.nasa.com"
}

如果我使用AND运算符执行以下查询,它将正常地找到该文档,因为所有

If I execute the query below using the AND operator, it will find the doc normally, because all the words searched exist in the doc.

{
    "from": 0,
    "size": 10,
    
    "query": {
      
            
                "multi_match": {
                    "query": "space nasa rocket",
                    "type": "cross_fields",
                    "fields": [
                        "title",
                        "description",
                        "url"
                    ],
                    "operator": "and"
              }

    }
}

但是如果我把它放在搜索也为 1960。因为下面的查询不会返回任何内容

but if I put it in the search also "1960" as the query below does not return anything

{
        "from": 0,
        "size": 10,
        
        "query": {
          
                
                    "multi_match": {
                        "query": "1960 space nasa rocket",
                        "type": "cross_fields",
                        "fields": [
                            "title",
                            "description",
                            "url"
                        ],
                        "operator": "and"
                  }
    
        }
    }

我发现我的小写 tokenizer不会生成数字令牌。因此,我将令牌生成器更改为标准并生成了1960年的数字令牌。

I found that my "lowercase" tokenizer does not generate a numeric token. So I change my tokenizer to "standard" and the 1960 numeric token is generated.

,但是该查询未找到任何内容,因为具有链接 www.nasa.com 不再生成令牌 www nasa com,生成的令牌是整个链接 www.nasa.com

but the query does not find anything, because the URL field that has the link www.nasa.com no longer generates the token "www nasa com" the generated token is the entire link www.nasa.com.

仅当我输入完整的URL www.nasa.com 时,查询才起作用,如下所示

The query only works if I enter the full URL www.nasa.com as shown below

{
            "from": 0,
            "size": 10,
            
            "query": {
              
                    
                        "multi_match": {
                            "query": "1960 space www.nasa.com rocket",
                            "type": "cross_fields",
                            "fields": [
                                "title",
                                "description",
                                "url"
                            ],
                            "operator": "and"
                      }
        
            }
        }

如果我生成另一个小写令牌生成器仅用于URL字段,链接 www.nasa.com 再次生成单独的令牌 www nasa com,

If I generate another "lowercase" tokenizer just for the URL field the link www.nasa.com again generates the separate tokens "www nasa com"

,但是我在下面的查询中找不到任何内容,因为URL字段的标记符与其他字段的标题和描述不同。下面的查询仅在使用OR运算符的情况下有效,但是我需要AND运算符

but my query below does not find anything, because the URL field has a different tokenizer than the other fields title and description. The query below only works if I use the OR operator, but I need the AND operator,

{
                "from": 0,
                "size": 10,
                
                "query": {
                  
                        
                            "multi_match": {
                                "query": "1960 space nasa rocket",
                                "type": "cross_fields",
                                "fields": [
                                    "title",
                                    "description",
                                    "url"
                                ],
                                "operator": "and"
                          }
            
                }
            }

我无法在我的应用程序中使用Ngram映射,因为我使用了词组建议程序

I cannot use Ngram in my mapping because I use "Phrase Suggester" and when I use Ngram the suggestions are being generated with hundreds of tokens generating inaccuracy in the suggestions.

有人会知道我的映射的任何解决方案都能够在我的中生成数字令牌。标题和说明栏位,但我的URL栏位会继续,并将网站连结分解成几个标记 www nasa com。而不是整个链接都是 www .nasa.com,而是我的查询作为AND运算符同时搜索所有字段。

would anyone know any solution for my mapping to be able to generate numeric tokens in my "title and descripton" fields, but that my URL field will continue with the website links being broken into several tokens "www nasa com" instead of the link being whole "www .nasa.com "and that my query works as an AND operator searching all fields at the same time.

推荐答案


如果我在搜索结果中也输入 1960,因为下面的查询不会
返回任何内容

If I put it in the search also "1960" as the query below does not return anything

在以下索引映射中,我删除了 synonym_filter 。将其删除并为示例文档建立索引,并运行与您在问题中提到的搜索查询相同的搜索查询后,我可以获得所需的结果

In the following Index Mapping, I have removed synonym_filter. After removing it and indexing the sample documents, and running the same search query as you mentioned in the question, I am able to get the desired result

索引映射:

 {
  "settings": {
    "index": {
      "number_of_shards": "5",
      "number_of_replicas": "0",
      "analysis": {
        "filter": {
          "stemmer_plural_portugues": {
            "name": "minimal_portuguese",
            "stopwords": [
              "http",
              "https",
              "ftp",
              "www"
            ],
            "type": "stemmer"
          },
          "shingle_filter": {
            "type": "shingle",
            "min_shingle_size": 2,
            "max_shingle_size": 3
          }
        },
        "analyzer": {
          "analyzer_customizado": {
            "filter": [
              "lowercase",
              "stemmer_plural_portugues",
              "asciifolding",
              "shingle_filter"
            ],
            "tokenizer": "lowercase"
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "long"
      },
      "data": {
        "type": "date"
      },
      "quebrado": {
        "type": "byte"
      },
      "pgrk": {
        "type": "integer"
      },
      "url_length": {
        "type": "integer"
      },
      "title": {
        "analyzer": "analyzer_customizado",
        "type": "text",
        "fields": {
          "keyword": {
            "ignore_above": 256,
            "type": "keyword"
          }
        }
      },
      "description": {
        "analyzer": "analyzer_customizado",
        "type": "text",
        "fields": {
          "keyword": {
            "ignore_above": 256,
            "type": "keyword"
          }
        }
      },
      "url": {
        "analyzer": "analyzer_customizado",
        "type": "text",
        "fields": {
          "keyword": {
            "ignore_above": 256,
            "type": "keyword"
          }
        }
      }
    }
  }
}

搜索查询:

    {
  "from": 0,
  "size": 10,
  "query": {
    "multi_match": {
      "query": "1960 space nasa rocket",
      "type": "cross_fields",
      "fields": [
        "title",
        "description",
        "url"
      ],
      "operator": "and"
    }
  }
}

搜索结果:

"hits": [
        {
            "_index": "my-index",
            "_type": "_doc",
            "_id": "1",
            "_score": 0.9370217,
            "_source": {
                "title": "rocket 1960",
                "description": "space",
                "url": "www.nasa.com"
            }
        }
    ]

正如@Gibbs所说,我认为 synonym_filter 中存在一些问题,因此最好共享 synonym.txt ,否则搜索查询运行得很好。

As stated by @Gibbs, I think there is some issue in synonym_filter, so it would be better if you share synonym.txt otherwise, the search query is running perfectly.

更新1 :(包括synonym_filter)

如果要包含同义词令牌过滤器,保持索引映射与您的索引映射相同,只需在映射中进行一些更改即可:

 "synonym_filter": {
        "type": "synonym",
        "lenient": true,
        "synonyms_path": "analysis/synonym.txt",
        "updateable" : false  --> set this to false

      },



您设置了同义词过滤器更改为可更新,大概是因为您
想要更改同义词而不必关闭并重新打开索引
而是使用重新加载API。可更新的同义词限制了它们仅在搜索时使用的
分析器。

You set your synonym filter to "updateable", presumably because you want to change synonyms without having to close and reopen the index but instead use the reload API. Updatable synonyms restrict the analyzer they are used in to be only used at search time .

要获得对此的完整解释,您可以请参阅此ES 讨论

To get the full explanation of this, you can refer to this ES discussion

使用与上述相同的搜索查询(在更改映射
后),您将获得所需的结果。

但是如果您仍然想设置 updateable, :正确,那么您可以参考重新加载搜索分析器API

But if you still want to set "updateable" : true, then you can refer official documentation of Reload search analyzers API

这篇关于带有数字标记的elasticsearch映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆