为什么HTML标签是可搜索的,即使是在弹性搜索中进行过滤 [英] Why HTML tag is searchable even if it was filtered in elastic search

查看:168
本文介绍了为什么HTML标签是可搜索的,即使是在弹性搜索中进行过滤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚弹性搜索,并且正在测试html_strip过滤器。理想情况下,我无法搜索HTML标签。以下是步骤。



索引:

  curl -XPOST'localhost :9200 / foo / test / _analyzer?tokenizer = standard& char_filters = html_strip'-d'
{
content:< title> Dilip Kumar< / title>
}'

搜索:

  http:// localhost:9200 / foo / test / _search?tokenizer = standard& char_filters = html_strip& q = title 

结果:

  {
taken 3,
timed_out:false,
_shards:{
total:5,
success:5,
failed
},
hits:{
total:1,
max_score:0.2169777,
hits:[
{
_index:foo,
_type:test,
_id:_analyzer,
_score:0.2169777,
_source:{
content:< title> Dilip Kumar< / title>
}
}
]
}
}

更新
建议;我删除了现有的索引之后,我使用了以下映射并重复上述步骤,但是我仍然可以搜索标记。

  curl -XPUT http:// localhost:9200 / foo-d'
{
foo:{
settings:{
analysis:{
分析器:{
html_analyzer:{
type:custom,
tokenizer:standard,
filter:[
标准,
小写,
停止,
asciifolding
],
char_filter:[
html_strip
]
},
whitespace_analyzer:{
type:custom,
tokenizer:whitespace,
filter :[
standard,
smallcase,
stop,
asciifolding
]
}
}
}
},
mappings:{
test:{
properties:{
content:{
type:string,
index_analyzer:html_analyzer,
search_analyzer:whitespace_analyzer
}
}
}
}
}
}'


解决方案

您需要在映射索引之前应用分析器。
这将确保索引的所有文档都通过此映射,并且所有标签都将在索引之前被删除。
在您的情况下,您在查询时应用了分析器,这只会影响您的搜索短语,而不会影响您搜索的数据。



您可以阅读更多关于创建映射 here



我不相信有这样的格式 -

  http:// localhost:9200 / foo / test / _search?tokenizer = standard& char_filters = html_strip& q = title 

相反,如果您可以设置分析器如下,它应该正常工作 -

  curl -XPUThttp:// localhost:9200 / foo-d'
{
foo:{
settings:{
analysis:{
analyzer:{
html_analyzer:{
type:custom,
tokenizer:standard,
filter:[
standard,
smallcase,
停止,
asciifolding
],
char_filter:[
html_strip
]
},
whitespace_analyzer:{
type:custom,
tokenizer:whitespace,
filter:[
standard,
smallcase,
停止,
asciifolding
]
}
}
}
},
映射:{
test:{
properties:{
content:{
type:string,
analyzer:html_analyzer
}
}
}
}
}
}'

这里我使分析器通用于索引和搜索


I am new to elasticsearch and was testing html_strip filter. Ideally I should not be able to search on HTML tags. Following is steps.

Index:

curl -XPOST 'localhost:9200/foo/test/_analyzer?tokenizer=standard&char_filters=html_strip' -d '
{
    "content" : "<title>Dilip Kumar</title>"
}'

Search:

http://localhost:9200/foo/test/_search?tokenizer=standard&char_filters=html_strip&q=title

Result:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2169777,
    "hits": [
      {
        "_index": "foo",
        "_type": "test",
        "_id": "_analyzer",
        "_score": 0.2169777,
        "_source": {
          "content": "<title>Dilip Kumar</title>"
        }
      }
    ]
  }
}

UPDATE As suggested; I used following mapping and repeated above steps after deleting the existing index however still I am able to search markup.

curl -XPUT "http://localhost:9200/foo " -d'
{
  "foo": {
    "settings": {
      "analysis": {
        "analyzer": {
          "html_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "standard",
              "lowercase",
              "stop",
              "asciifolding"
            ],
            "char_filter": [
              "html_strip"
            ]
          },
          "whitespace_analyzer": {
            "type": "custom",
            "tokenizer": "whitespace",
            "filter": [
              "standard",
              "lowercase",
              "stop",
              "asciifolding"
            ]
          }
        }
      }
    },
    "mappings": {
      "test": {
        "properties": {
          "content": {
            "type": "string",
            "index_analyzer": "html_analyzer",
            "search_analyzer": "whitespace_analyzer"
          }
        }
      }
    }
  }
}'

解决方案

You need to apply analyzer before indexing on the mapping. This will make sure all documents that are indexed passes through this mapping and all the tags are stripped out before indexing. In your case , you applied the analyzer while querying and this will only affect your search phrase and not the data you search.

You can read more on creating mapping here

I dont believe there is format like this -

http://localhost:9200/foo/test/_search?tokenizer=standard&char_filters=html_strip&q=title

Rather if you can set the analyzer as follows , it should work fine -

curl -XPUT "http://localhost:9200/foo " -d'
{
  "foo": {
    "settings": {
      "analysis": {
        "analyzer": {
          "html_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "standard",
              "lowercase",
              "stop",
              "asciifolding"
            ],
            "char_filter": [
              "html_strip"
            ]
          },
          "whitespace_analyzer": {
            "type": "custom",
            "tokenizer": "whitespace",
            "filter": [
              "standard",
              "lowercase",
              "stop",
              "asciifolding"
            ]
          }
        }
      }
    },
    "mappings": {
      "test": {
        "properties": {
          "content": {
            "type": "string",
            "analyzer": "html_analyzer"
          }
        }
      }
    }
  }
}'

Here i made the analyzer common for indexing and searching

这篇关于为什么HTML标签是可搜索的,即使是在弹性搜索中进行过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆