Elasticsearch:在使用html_strip过滤器为文档建立索引之前去除HTML标记 [英] Elasticsearch : Strip HTML tags before indexing docs with html_strip filter not working

查看:60
本文介绍了Elasticsearch:在使用html_strip过滤器为文档建立索引之前去除HTML标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出,我已经在自定义分析器中指定了html strip char过滤器

Given I have specified my html strip char filter in my custom analyser

何时我将包含html内容的文档编入索引

When I index a document with html content

然后,我希望html将从索引内容中删除

Then I expect the html to be strip out of the indexed content

并且在从索引中检索返回的文档时,应不包含hmtl

And on retrieval the returned doc from the index shoult not contain hmtl

实际:索引文档包含html检索到的文档包含html

ACTUAL: The indexed doc contained html The retrieved doc contained html

我已经尝试将分析器指定为index_analyzer,就像一个人期望的那样,其他一些则从绝望的search_analyzer和分析器中指定.Non似乎对被索引或检索的文档没有任何影响.

I have tried specifying the analyzer as index_analyzer as one would expect and a few others out of desperation search_analyzer and analyzer. Non seem to have any effect on the doc being indexed or retrieve.

请求:具有html内容的POST文档示例

POST /html_poc_v2/html_poc_type/02
{
  "description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
  "title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
  "body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>"
}

期望:已通过html分析器解析的索引数据. Actual :使用html索引数据

Expected : indexed data to have being parsed through the html analyser. Actual : data is indexed with html

响应

{
   "_index": "html_poc_v2",   "_type": "html_poc_type",   "_id": "02", ...
   "_source": {
      "description": "Description <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
      "title": "Title <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>",
      "body": "Body <p>Some d&eacute;j&agrave; vu <a href=\"http://somedomain.com>\">website</a>"
   }
}

设置和文档映射

PUT /html_poc_v2
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_html_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ]
        }
      }
    },
    "mappings": {
      "html_poc_type": {
        "properties": {
          "body": {
            "type": "string",
            "analyzer": "my_html_analyzer"
          },
          "description": {
            "type": "string",
            "analyzer": "my_html_analyzer"
          },
          "title": {
            "type": "string",
            "search_analyser": "my_html_analyzer"
          },
          "urlTitle": {
            "type": "string"
          }
        }
      }
    }
  }
}

测试以证明Custom Analyzer可以正常运行:

请求

GET /html_poc_v2/_analyze?analyzer=my_html_analyzer
{<p>Some d&eacute;j&agrave; vu <a href="http://somedomain.com>">website</a>}

回复

{
   "tokens": [
      {
         "token": "Some",… "position": 1
      },
      {
         "token": "déjà",… "position": 2
      },
      {
         "token": "vu",…  "position": 3
      },
      {
         "token": "website",… "position": 4
      }
   ]
}

引擎盖下

通过嵌入式脚本深入了解我的html分析器一定已被跳过

Under the hood

going under the hood with an in-line script proofs further that my html analyser must have been skipped

请求

GET /html_poc_v2/html_poc_type/_search?pretty=true
{
  "query" : {
    "match_all" : { }
  },
  "script_fields": {
    "terms" : {
        "script": "doc[field].values",
        "params": {
            "field": "title"
        }
    }
  }
}

响应

{ …
   "hits": { ..
      "hits": [
         {
            "_index": "html_poc_v2",
            "_type": "html_poc_type",
            …
            "fields": {
               "terms": [
                  [
                     "a",
                     "agrave",
                     "d",
                     "eacute",
                     "href",
                     "http",
                     "j",
                     "p",
                     "some",
                     "somedomain.com",
                     "title",
                     "vu",
                     "website"
                  ]
               ]
            }
         }
      ]
   }
}

类似于此问题:我还阅读了这份令人惊叹的文档:https://www.elastic.co/guide/zh-CN/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

I have also read this amazing doc : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

ES版本:1.7.2

ES version : 1.7.2

请帮助.

推荐答案

您正在混淆响应中的" _source "字段以返回正在分析和建立索引的内容.您似乎期望 _source 字段作为响应返回已分析的文档.这是不正确的.

You are confusing the "_source" field in the response to return what is being analyzed and indexed. It looks like your expectation is that the _source field in response returns the analyzed document. This is incorrect.

文档中;

_source字段包含原始的JSON文档主体,该主体是在索引时间通过._source字段本身未编制索引(并且因此无法搜索),但会对其进行存储,以便可以将其返回在执行获取请求(例如get或search)时.

The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing fetch requests, like get or search.

理想情况下,在上述情况下,您要格式化源数据以用于演示目的,应在客户端完成该操作.

Ideally in the above case wherein you want to format the source data for presentation purposes it should be done at the client end.

尽管如此,在上述用例中实现这一目标的一种方法是使用关键字令牌生成器如下:

However that being said one way to achieve it for the above use case is using script fields and keyword-tokenizer as follows :

PUT test
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_html_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "char_filter": [
                  "html_strip"
               ]
            },
            "parsed_analyzer": {
               "type": "custom",
               "tokenizer": "keyword",
               "char_filter": [
                  "html_strip"
               ]
            }
         }
      }
   },
   "mappings": {
      "test": {
         "properties": {
            "body": {
               "type": "string",
               "analyzer": "my_html_analyzer",
               "fields": {
                  "parsed": {
                     "type": "string",
                     "analyzer": "parsed_analyzer"
                  }
               }
            }
         }
      }
   }
}


PUT test/test/1 
{
    "body" : "Title <p> Some d&eacute;j&agrave; vu <a href='http://somedomain.com'> website </a> <span> this is inline </span></p> "
}

GET test/_search
{
  "query" : {
    "match_all" : { }
  },
  "script_fields": {
    "terms" : {
        "script": "doc[field].values",
        "params": {
            "field": "body.parsed"
        }
    }
  }
}

结果:

{
   "_index": "test",
   "_type": "test",
   "_id": "1",
   "_score": 1,
   "fields": {
        "terms": [
            "Title \n Some déjà vu  website   this is inline \n "
           ]
        }
   }

请注意,我认为上述做法不是一个好主意,因为剥离html标记很容易在客户端实现,并且您可以在格式化方面拥有更多的控制权,而不必依赖于诸如此类的变通方法.更重要的是,它也许是高性能的在客户端执行的操作.

note I believe the above is a bad idea since stripping the html tags could be easily achived on the client end and you would have much more control with regard to formatting than depending on a work around such as this. More importantly it maybe performant doing it on the client side.

这篇关于Elasticsearch:在使用html_strip过滤器为文档建立索引之前去除HTML标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆