如何使用附件插件在Elasticsearch中使用空格字符进行搜索? [英] How to allow searching with spacial character in Elasticsearch using Attachment plugin?

查看:39
本文介绍了如何使用附件插件在Elasticsearch中使用空格字符进行搜索?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究Spring Boot-JHipster基础项目.

I'm working on Spring Boot - JHipster base project.

我正在使用带有附件插件的Elasticsearch 6.8.6.这样,内容字段就包含我的文档数据.

I'm using Elasticsearch 6.8.6 with Attachment plugin. In that, content field have data of my document.

现在,当我搜索"192.168.31.167"时,它会给出适当的结果.但是,当我搜索"192.168.31.167:9200"时,结果为空.

Now, When i search '192.168.31.167' it gives an appropriate result. But, when i search this "192.168.31.167:9200" it gives an empty result.

简而言之,它不适用于特殊字符.有人可以指导我.该如何处理?

In short, it's not working with spacial characters. Can someone guide me. How to deal with this?

映射:

{
  "document" : {
    "mappings" : {
      "doc" : {
        "properties" : {
          "attachment" : {
            "properties" : {
              "content" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              },
              "content_length" : {
                "type" : "long"
              },
              "content_type" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          },
          "content" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "createdDate" : {
            "type" : "date"
          },
          "holder" : {
            "type" : "long"
          },
          "id" : {
            "type" : "long"
          },
          "name" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "tag" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

虚拟数据:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "document",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "createdDate" : "2020-05-19T03:56:36+0000",
          "attachment" : {
            "content_type" : "text/plain; charset=ISO-8859-1",
            "content" : "version: '2'\nservices:\n  docy-kibana:\n    image: docker.elastic.co/kibana/kibana:6.8.6\n    ports:\n      - 5601:5601\n\n    environment:\n      SERVER_NAME: kibana.example.org\n      ELASTICSEARCH_HOSTS: http://192.168.31.167:9200/\n      XPACK_MONITORING_ENABLED: ${true}\n#      XPACK_ENCRYPTEDSAVEDOBJECTS.ENCRYPTIONKEY: test\n      XPACK_MONITORING_UI_CONTAINER_ELASTICSEARCH_ENABLED: ${true}",
            "content_length" : 390
          },
          "name" : "kibana_3_202005190926.yml",
          "holder" : 3,
          "id" : 1,
          "tag" : "configuration",
          "content" : "dmVyc2lvbjogJzInCnNlcnZpY2VzOgogIGRvY3kta2liYW5hOgogICAgaW1hZ2U6IGRvY2tlci5lbGFzdGljLmNvL2tpYmFuYS9raWJhbmE6Ni44LjYKICAgIHBvcnRzOgogICAgICAtIDU2MDE6NTYwMQoKICAgIGVudmlyb25tZW50OgogICAgICBTRVJWRVJfTkFNRToga2liYW5hLmV4YW1wbGUub3JnCiAgICAgIEVMQVNUSUNTRUFSQ0hfSE9TVFM6IGh0dHA6Ly8xOTIuMTY4LjMxLjE2Nzo5MjAwLwogICAgICBYUEFDS19NT05JVE9SSU5HX0VOQUJMRUQ6ICR7dHJ1ZX0KIyAgICAgIFhQQUNLX0VOQ1JZUFRFRFNBVkVET0JKRUNUUy5FTkNSWVBUSU9OS0VZOiB0ZXN0CiAgICAgIFhQQUNLX01PTklUT1JJTkdfVUlfQ09OVEFJTkVSX0VMQVNUSUNTRUFSQ0hfRU5BQkxFRDogJHt0cnVlfQo="
        }
      }
    ]
  }
}

由代码生成的Elasticsearch请求:

Elasticsearch Request generated by code :

{
  "bool" : {
    "must" : [
      {
        "bool" : {
          "should" : [
            {
              "query_string" : {
                "query" : "*192.168.31.167:9200*",
                "fields" : [
                  "content^1.0",
                  "name^2.0",
                  "tag^3.0"
                ],
                "type" : "best_fields",
                "default_operator" : "or",
                "max_determinized_states" : 10000,
                "enable_position_increments" : true,
                "fuzziness" : "AUTO",
                "fuzzy_prefix_length" : 0,
                "fuzzy_max_expansions" : 50,
                "phrase_slop" : 0,
                "analyze_wildcard" : true,
                "escape" : false,
                "auto_generate_synonyms_phrase_query" : true,
                "fuzzy_transpositions" : true,
                "boost" : 1.0
              }
            },
            {
              "wildcard" : {
                "attachment.content" : {
                  "wildcard" : "*192.168.31.167:9200*",
                  "boost" : 1.0
                }
              }
            }
          ],
          "adjust_pure_negative" : true,
          "boost" : 1.0
        }
      },
      {
        "bool" : {
          "should" : [
            {
              "wildcard" : {
                "tag.keyword" : {
                  "wildcard" : "*information*",
                  "boost" : 1.0
                }
              }
            },
            {
              "wildcard" : {
                "tag.keyword" : {
                  "wildcard" : "*user*",
                  "boost" : 1.0
                }
              }
            }
          ],
          "adjust_pure_negative" : true,
          "boost" : 1.0
        }
      }
    ],
    "adjust_pure_negative" : true,
    "boost" : 1.0
  }
}

推荐答案

问题:

您正在使用 text 字段查询使用 standard 分析器并在:上拆分文本的数据,如下所示分析API 调用:

You are using the text field to query the data which uses standard analyzer and split text on : as shown in below analyze API call:

POST /_analyze
{
    "text" : "127.0.0.1:9200",
    "analyzer" : "standard"
}

生成的令牌

{
    "tokens": [
        {
            "token": "127.0.0.1",
            "start_offset": 0,
            "end_offset": 9,
            "type": "<NUM>",
            "position": 0
        },
        {
            "token": "9200",
            "start_offset": 10,
            "end_offset": 14,
            "type": "<NUM>",
            "position": 1
        }
    ]
}

解决方案-1

未优化(较大索引上的通配符查询可能会导致严重的性能问题),但由于您已经在使用通配符,因此无需更改分析器就可以重新索引整个数据(较少的开销):

Not optimized(wildcard query on the bigger index can cause severe perf issues) but as you are already using the wildcard it would work without changing analyzer and reindex the whole data(less overhead):

使用 .keyword 这些文本字段上可用的字段,该字段不会将文本分成2个标记,如下所示

Use .keyword the field which is available on these text field which would not split the text into 2 tokens as shown below

{
    "tokens": [
        {
            "token": "127.0.0.1:9200",
            "start_offset": 0,
            "end_offset": 14,
            "type": "word",
            "position": 0
        }
    ]
}

您可以添加 .keyword ,如下所示:

You can add .keyword as shown below:

             "content.keyword^1.0",
              "name.keyword^2.0",
              "tag.keyword^3.0"

解决方案-2

通过@val引用注释中提到的解决方案,这将涉及创建自定义分析器并重新索引整个数据,这将在索引中创建预期的令牌,然后在不使用昂贵的正则表达式的情况下对其进行搜索.这将在大型数据集上具有显着更好的性能,但是使用新的分析器和查询为整个数据重新编制索引会产生一次开销.

Refer the solution mentioned in the comment by @val which would involve creating a custom analyzer and reindex the whole data which will create expected tokens in the index and then search on them without using the expensive regex. This will have a significantly better performance on large datasets but one time overhead of reindexing the whole data with new analyzer and queries.

请选择更适合您业务需求的方法.

这篇关于如何使用附件插件在Elasticsearch中使用空格字符进行搜索?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆