Elasticsearch:在使用html_strip过滤器为文档建立索引之前去除HTML标记 [英] Elasticsearch : Strip HTML tags before indexing docs with html_strip filter not working
问题描述
给出,我已经在自定义分析器中指定了html strip char过滤器
Given I have specified my html strip char filter in my custom analyser
何时我将包含html内容的文档编入索引
When I index a document with html content
然后,我希望html将从索引内容中删除
Then I expect the html to be strip out of the indexed content
并且在从索引中检索返回的文档时,应不包含hmtl
And on retrieval the returned doc from the index shoult not contain hmtl
实际:索引文档包含html检索到的文档包含html
ACTUAL: The indexed doc contained html The retrieved doc contained html
我已经尝试将分析器指定为index_analyzer,就像一个人期望的那样,其他一些则从绝望的search_analyzer和分析器中指定.Non似乎对被索引或检索的文档没有任何影响.
I have tried specifying the analyzer as index_analyzer as one would expect and a few others out of desperation search_analyzer and analyzer. Non seem to have any effect on the doc being indexed or retrieve.
请求:具有html内容的POST文档示例
POST /html_poc_v2/html_poc_type/02
{
"description": "Description <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>",
"title": "Title <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>",
"body": "Body <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>"
}
期望:已通过html分析器解析的索引数据. Actual :使用html索引数据
Expected : indexed data to have being parsed through the html analyser. Actual : data is indexed with html
响应
{
"_index": "html_poc_v2", "_type": "html_poc_type", "_id": "02", ...
"_source": {
"description": "Description <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>",
"title": "Title <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>",
"body": "Body <p>Some déjà vu <a href=\"http://somedomain.com>\">website</a>"
}
}
设置和文档映射
PUT /html_poc_v2
{
"settings": {
"analysis": {
"analyzer": {
"my_html_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
]
}
}
},
"mappings": {
"html_poc_type": {
"properties": {
"body": {
"type": "string",
"analyzer": "my_html_analyzer"
},
"description": {
"type": "string",
"analyzer": "my_html_analyzer"
},
"title": {
"type": "string",
"search_analyser": "my_html_analyzer"
},
"urlTitle": {
"type": "string"
}
}
}
}
}
}
测试以证明Custom Analyzer可以正常运行:
请求
GET /html_poc_v2/_analyze?analyzer=my_html_analyzer
{<p>Some déjà vu <a href="http://somedomain.com>">website</a>}
回复
{
"tokens": [
{
"token": "Some",… "position": 1
},
{
"token": "déjà",… "position": 2
},
{
"token": "vu",… "position": 3
},
{
"token": "website",… "position": 4
}
]
}
引擎盖下
通过嵌入式脚本深入了解我的html分析器一定已被跳过
Under the hood
going under the hood with an in-line script proofs further that my html analyser must have been skipped
请求
GET /html_poc_v2/html_poc_type/_search?pretty=true
{
"query" : {
"match_all" : { }
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "title"
}
}
}
}
响应
{ …
"hits": { ..
"hits": [
{
"_index": "html_poc_v2",
"_type": "html_poc_type",
…
"fields": {
"terms": [
[
"a",
"agrave",
"d",
"eacute",
"href",
"http",
"j",
"p",
"some",
"somedomain.com",
"title",
"vu",
"website"
]
]
}
}
]
}
}
类似于此问题:我还阅读了这份令人惊叹的文档:https://www.elastic.co/guide/zh-CN/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html
I have also read this amazing doc : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html
ES版本:1.7.2
ES version : 1.7.2
请帮助.
推荐答案
您正在混淆响应中的" _source "字段以返回正在分析和建立索引的内容.您似乎期望 _source
字段作为响应返回已分析的文档.这是不正确的.
You are confusing the "_source" field in the response to return what is being analyzed and indexed.
It looks like your expectation is that the _source
field in response returns the analyzed document. This is incorrect.
从文档中;
_source字段包含原始的JSON文档主体,该主体是在索引时间通过._source字段本身未编制索引(并且因此无法搜索),但会对其进行存储,以便可以将其返回在执行获取请求(例如get或search)时.
The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing fetch requests, like get or search.
理想情况下,在上述情况下,您要格式化源数据以用于演示目的,应在客户端完成该操作.
Ideally in the above case wherein you want to format the source data for presentation purposes it should be done at the client end.
尽管如此,在上述用例中实现这一目标的一种方法是使用关键字令牌生成器如下:
However that being said one way to achieve it for the above use case is using script fields and keyword-tokenizer as follows :
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_html_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
]
},
"parsed_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"char_filter": [
"html_strip"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"body": {
"type": "string",
"analyzer": "my_html_analyzer",
"fields": {
"parsed": {
"type": "string",
"analyzer": "parsed_analyzer"
}
}
}
}
}
}
}
PUT test/test/1
{
"body" : "Title <p> Some déjà vu <a href='http://somedomain.com'> website </a> <span> this is inline </span></p> "
}
GET test/_search
{
"query" : {
"match_all" : { }
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "body.parsed"
}
}
}
}
结果:
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1,
"fields": {
"terms": [
"Title \n Some déjà vu website this is inline \n "
]
}
}
请注意,我认为上述做法不是一个好主意,因为剥离html标记很容易在客户端实现,并且您可以在格式化方面拥有更多的控制权,而不必依赖于诸如此类的变通方法.更重要的是,它也许是高性能的在客户端执行的操作.
note I believe the above is a bad idea since stripping the html tags could be easily achived on the client end and you would have much more control with regard to formatting than depending on a work around such as this. More importantly it maybe performant doing it on the client side.
这篇关于Elasticsearch:在使用html_strip过滤器为文档建立索引之前去除HTML标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!