ElasticSearch阻止html标签可搜索 [英] ElasticSearch prevent html tags from being searchable

查看:599
本文介绍了ElasticSearch阻止html标签可搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由其他应用程序标记的文本。我不希望查询这些标签时返回查询。

I have a text that is tagged by a different application. I don't want queries for these tags to be returned when queried for.

我尝试使用html_strip,但仍能够搜索这些标签。

I tried using html_strip but I was still able to search these tags.

标签示例可能会有所不同但它们类似于< PERSON> Freddy< / PERSON>
我也尝试过< span> Freddy< / span> ,在这两个结果中我都可以搜索 span PERSON 并获得结果,而这些词不会出现在其他任何地方。

Example of the tags could vary but they're similar to <PERSON>Freddy</PERSON>. I also tried with <span>Freddy</span> and in both results I could search for either span or PERSON and get the result without these words appearing anywhere else.

我在做什么错了?

索引映射:

{
  "mapping": {
    "properties":{
        "text":{
            "type":"text",
            "analyzer":"my_analyzer"
        }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip"
        }
      }
    }
  }
}

查询

{
    "query":{
        "match":{
            "text":"span"
        }
    },
    "highlight":{
        "fields":{
            "text":{}
        }
    }
}

响应:

..
"hits": [
            {
                "_index": "my_index",
                "_type": "wat",
                "_id": "1",
                "_score": 0.39556286,
                "_source": {
                    "text": "Hello <span>Freddy</span>"
                },
                "highlight": {
                    "text": [
                        "Hello <<em>span</em>>Freddy</<em>span</em>>"
                    ]
                }
            }
        ]
...


推荐答案

您在这里遇到了几个问题;首先, mapping 应该是 mappings ,并且在声明映射时会丢失类型(因此类型 wat 实际上根本没有获得该映射)。您可以使用以下方式:

You have a couple of problems here; first, mapping should be mappings, and you are missing the type when you declare the mappings (so your type wat isn't actually getting that mapping at all). You can use this:

{
  "mappings": {
    "wat": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip"
        }
      }
    }
  }
}

您可以使用获取映射API ,以确认您看到的 wat 类型映射。

You can use the get mapping api to confirm that you see the mapping for the type wat like you expect.

然后,如果您索引 Hello< span> Freddy< / s pan> ,然后搜索 Hello Freddy ,您将看到此结果。存储的术语是 Hello Freddy ,但是您会在搜索结果中看到 span 标记,因为结果返回来源(您索引的值),而不是分析的术语。 (如果您搜索 Hello< span> Freddy< / span> ,您也会看到相同的结果,但这是因为查询文本的分析方式与索引相同文本。)

Then if you index Hello <span>Freddy</span>, and search Hello Freddy, you will see this result. The term that's stored is Hello Freddy, but you will see the span tags in the search result, because the result returns the source (the value you indexed), not the analyzed terms. (You will also see the same result if you search Hello <span>Freddy</span>, but that's because the query text is analyzed in the same way as the indexed text.)

请注意,由于您使用了关键字令牌生成器,因此如果您进行搜索,将不会获得任何结果 Hello Freddy 。如果要在字符串中搜索,而不是搜索完整的字符串(或通配符,正则表达式等),则应使用其他标记符(例如 standard 标记符)

Note that since you've used the keyword tokenizer, you will get no results if you search Hello or Freddy. If you want to search within the string, instead of searching the full string (or wildcard, regexp, etc.), you should use a different tokenizer (like the standard tokenizer).

另一个警告: html_strip 过滤器似乎仅过滤有效的html标签(因此不适用于< PERSON> )。您可能可以使用模式过滤器代替。

Another warning: the html_strip filter seems to only filter valid html tags (so it won't work for <PERSON>). You can probably use the pattern filter instead.

这篇关于ElasticSearch阻止html标签可搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆