ElasticSearch阻止html标签可搜索 [英] ElasticSearch prevent html tags from being searchable
问题描述
我有一个由其他应用程序标记的文本。我不希望查询这些标签时返回查询。
I have a text that is tagged by a different application. I don't want queries for these tags to be returned when queried for.
我尝试使用html_strip,但仍能够搜索这些标签。
I tried using html_strip but I was still able to search these tags.
标签示例可能会有所不同但它们类似于< PERSON> Freddy< / PERSON>
。
我也尝试过< span> Freddy< / span>
,在这两个结果中我都可以搜索 span 或 PERSON 并获得结果,而这些词不会出现在其他任何地方。
Example of the tags could vary but they're similar to <PERSON>Freddy</PERSON>
.
I also tried with <span>Freddy</span>
and in both results I could search for either span or PERSON and get the result without these words appearing anywhere else.
我在做什么错了?
索引映射:
{
"mapping": {
"properties":{
"text":{
"type":"text",
"analyzer":"my_analyzer"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "html_strip"
}
}
}
}
}
查询
{
"query":{
"match":{
"text":"span"
}
},
"highlight":{
"fields":{
"text":{}
}
}
}
响应:
..
"hits": [
{
"_index": "my_index",
"_type": "wat",
"_id": "1",
"_score": 0.39556286,
"_source": {
"text": "Hello <span>Freddy</span>"
},
"highlight": {
"text": [
"Hello <<em>span</em>>Freddy</<em>span</em>>"
]
}
}
]
...
推荐答案
您在这里遇到了几个问题;首先, mapping
应该是 mappings
,并且在声明映射时会丢失类型(因此类型 wat
实际上根本没有获得该映射)。您可以使用以下方式:
You have a couple of problems here; first, mapping
should be mappings
, and you are missing the type when you declare the mappings (so your type wat
isn't actually getting that mapping at all). You can use this:
{
"mappings": {
"wat": {
"properties": {
"text": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "html_strip"
}
}
}
}
}
您可以使用获取映射API ,以确认您看到的 wat
类型映射。
You can use the get mapping api to confirm that you see the mapping for the type wat
like you expect.
然后,如果您索引 Hello< span> Freddy< / s pan>
,然后搜索 Hello Freddy
,您将看到此结果。存储的术语是 Hello Freddy
,但是您会在搜索结果中看到 span
标记,因为结果返回来源(您索引的值),而不是分析的术语。 (如果您搜索 Hello< span> Freddy< / span>
,您也会看到相同的结果,但这是因为查询文本的分析方式与索引相同文本。)
Then if you index Hello <span>Freddy</span>
, and search Hello Freddy
, you will see this result. The term that's stored is Hello Freddy
, but you will see the span
tags in the search result, because the result returns the source (the value you indexed), not the analyzed terms. (You will also see the same result if you search Hello <span>Freddy</span>
, but that's because the query text is analyzed in the same way as the indexed text.)
请注意,由于您使用了关键字
令牌生成器,因此如果您进行搜索,将不会获得任何结果 Hello
或 Freddy
。如果要在字符串中搜索,而不是搜索完整的字符串(或通配符,正则表达式等),则应使用其他标记符(例如 standard
标记符)
Note that since you've used the keyword
tokenizer, you will get no results if you search Hello
or Freddy
. If you want to search within the string, instead of searching the full string (or wildcard, regexp, etc.), you should use a different tokenizer (like the standard
tokenizer).
另一个警告: html_strip
过滤器似乎仅过滤有效的html标签(因此不适用于< PERSON>
)。您可能可以使用模式过滤器代替。
Another warning: the html_strip
filter seems to only filter valid html tags (so it won't work for <PERSON>
). You can probably use the pattern filter instead.
这篇关于ElasticSearch阻止html标签可搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!