为什么HTML标签是可搜索的,即使是在弹性搜索中进行过滤 [英] Why HTML tag is searchable even if it was filtered in elastic search
问题描述
索引:
curl -XPOST'localhost :9200 / foo / test / _analyzer?tokenizer = standard& char_filters = html_strip'-d'
{
content:< title> Dilip Kumar< / title>
}'
搜索:
http:// localhost:9200 / foo / test / _search?tokenizer = standard& char_filters = html_strip& q = title
结果:
{
taken 3,
timed_out:false,
_shards:{
total:5,
success:5,
failed
},
hits:{
total:1,
max_score:0.2169777,
hits:[
{
_index:foo,
_type:test,
_id:_analyzer,
_score:0.2169777,
_source:{
content:< title> Dilip Kumar< / title>
}
}
]
}
}
更新
建议;我删除了现有的索引之后,我使用了以下映射并重复上述步骤,但是我仍然可以搜索标记。
curl -XPUT http:// localhost:9200 / foo-d'
{
foo:{
settings:{
analysis:{
分析器:{
html_analyzer:{
type:custom,
tokenizer:standard,
filter:[
标准,
小写,
停止,
asciifolding
],
char_filter:[
html_strip
]
},
whitespace_analyzer:{
type:custom,
tokenizer:whitespace,
filter :[
standard,
smallcase,
stop,
asciifolding
]
}
}
}
},
mappings:{
test:{
properties:{
content:{
type:string,
index_analyzer:html_analyzer,
search_analyzer:whitespace_analyzer
}
}
}
}
}
}'
您需要在映射索引之前应用分析器。
这将确保索引的所有文档都通过此映射,并且所有标签都将在索引之前被删除。
在您的情况下,您在查询时应用了分析器,这只会影响您的搜索短语,而不会影响您搜索的数据。
您可以阅读更多关于创建映射 here
我不相信有这样的格式 -
http:// localhost:9200 / foo / test / _search?tokenizer = standard& char_filters = html_strip& q = title
相反,如果您可以设置分析器如下,它应该正常工作 -
curl -XPUThttp:// localhost:9200 / foo-d'
{
foo:{
settings:{
analysis:{
analyzer:{
html_analyzer:{
type:custom,
tokenizer:standard,
filter:[
standard,
smallcase,
停止,
asciifolding
],
char_filter:[
html_strip
]
},
whitespace_analyzer:{
type:custom,
tokenizer:whitespace,
filter:[
standard,
smallcase,
停止,
asciifolding
]
}
}
}
},
映射:{
test:{
properties:{
content:{
type:string,
analyzer:html_analyzer
}
}
}
}
}
}'
这里我使分析器通用于索引和搜索
I am new to elasticsearch and was testing html_strip filter. Ideally I should not be able to search on HTML tags. Following is steps.
Index:
curl -XPOST 'localhost:9200/foo/test/_analyzer?tokenizer=standard&char_filters=html_strip' -d '
{
"content" : "<title>Dilip Kumar</title>"
}'
Search:
http://localhost:9200/foo/test/_search?tokenizer=standard&char_filters=html_strip&q=title
Result:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2169777,
"hits": [
{
"_index": "foo",
"_type": "test",
"_id": "_analyzer",
"_score": 0.2169777,
"_source": {
"content": "<title>Dilip Kumar</title>"
}
}
]
}
}
UPDATE As suggested; I used following mapping and repeated above steps after deleting the existing index however still I am able to search markup.
curl -XPUT "http://localhost:9200/foo " -d'
{
"foo": {
"settings": {
"analysis": {
"analyzer": {
"html_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"asciifolding"
],
"char_filter": [
"html_strip"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"standard",
"lowercase",
"stop",
"asciifolding"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"content": {
"type": "string",
"index_analyzer": "html_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
}
}'
You need to apply analyzer before indexing on the mapping. This will make sure all documents that are indexed passes through this mapping and all the tags are stripped out before indexing. In your case , you applied the analyzer while querying and this will only affect your search phrase and not the data you search.
You can read more on creating mapping here
I dont believe there is format like this -
http://localhost:9200/foo/test/_search?tokenizer=standard&char_filters=html_strip&q=title
Rather if you can set the analyzer as follows , it should work fine -
curl -XPUT "http://localhost:9200/foo " -d'
{
"foo": {
"settings": {
"analysis": {
"analyzer": {
"html_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"asciifolding"
],
"char_filter": [
"html_strip"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"standard",
"lowercase",
"stop",
"asciifolding"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"content": {
"type": "string",
"analyzer": "html_analyzer"
}
}
}
}
}
}'
Here i made the analyzer common for indexing and searching
这篇关于为什么HTML标签是可搜索的,即使是在弹性搜索中进行过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!