如何修改Elasticsearch文档的_source字段 [英] How to modify _source field of an Elasticsearch document
问题描述
问题:有什么方法可以清除文档_source中的html吗?html的剥离可以是定期的,触发的,也可以在索引被索引时即时进行.
Question: Is there any way to clean the html from the _source of the doc? The stripping of the html can be periodic, triggered or ideally on the fly as its being indexed.
我要对进入Elasticsearch的数据进行索引,然后将它与分析器建立索引,该分析器会在索引被索引之前剥离不需要的htmls标签.
I have in coming data into elasticsearch being indexed against an Analyser that is stripping unwanted htmls tags before its being index.
根据查询/获取其_source字段,其中包含原始内容,该原始内容具有返回给客户端的html.
On Queries/Gets its the _source field with the original content that has the htmls that is returned to clients.
注意:
- 在传递给Elasticsearch之前,我无法清理数据,对此我无能为力.
- 我的客户从elasticsearch检索数据时,可以在呈现数据之前使用javascript进行剥离,但这两种方法都不可行.
推荐答案
_source仅存储已建立索引的JSON.您无法更改.
_source simply stores the JSON that has been indexed. You cannot change it.
现在,如果您想在索引和按原样存储内容之前完全去除html,则可以使用mapper附件插件-在其中定义映射时,可以将content_type归类为"html".
Now if you want to completely strip out the html prior to indexing and storing the content as is, you can use the mapper attachment plugin - in which when you define the mapping, you can categorize the content_type to be "html."
映射器附件在许多事情上很有用,尤其是在处理多种文档类型的情况下,但最值得注意的是-我相信仅将其用于剥离html标签就足够了(您无法使用html_strip char过滤器).
The mapper attachment is useful for many things especially if you are handling multiple document types, but most notably - I believe just using this for the purpose of stripping out the html tags is sufficient enough (which you cannot do with the html_strip char filter).
不过只是一个警告-不会存储任何html标记.因此,如果您确实需要这些标签,我建议您定义另一个字段来存储原始内容.另一个注意事项:您不能为映射器附件文档指定多字段,因此您需要将其存储在映射器附件文档之外.请参阅下面的工作示例.
Just a forewarning though - NONE of the html tags will be stored. So if you do need those tags somehow, I would suggest defining another field to store the original content. Another note: You cannot specify multifields for mapper attachment documents, so you would need to store that outside of the mapper attachment document. See my working example below.
您需要生成此映射:
{
"html5-es" : {
"aliases" : { },
"mappings" : {
"document" : {
"properties" : {
"delete" : {
"type" : "boolean"
},
"file" : {
"type" : "attachment",
"fields" : {
"content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"author" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
},
"content_length" : {
"type" : "integer"
},
"language" : {
"type" : "string"
}
}
},
"hash_id" : {
"type" : "string"
},
"path" : {
"type" : "string"
},
"raw_content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "raw"
},
"title" : {
"type" : "string"
}
}
}
},
"settings" : { //insert your own settings here },
"warmers" : { }
}
}
在NEST中,我将这样组装内容:
Such that in NEST, I will assemble the content as such:
Attachment attachment = new Attachment();
attachment.Content = Convert.ToBase64String(File.ReadAllBytes("path/to/document"));
attachment.ContentType = "html";
Document document = new Document();
document.File = attachment;
document.RawContent = InsertRawContentFromString(originalText);
我已经在Sense上对其进行了测试-结果如下:
I have tested this in Sense - results are as follows:
"file": {
"_content": "PGh0bWwgeG1sbnM6TWFkQ2FwPSJodHRwOi8vd3d3Lm1hZGNhcHNvZnR3YXJlLmNvbS9TY2hlbWFzL01hZENhcC54c2QiPg0KICA8aGVhZCAvPg0KICA8Ym9keT4NCiAgICA8aDE+VG9waWMxMDwvaDE+DQogICAgPHA+RGVsZXRlIHRoaXMgdGV4dCBhbmQgcmVwbGFjZSBpdCB3aXRoIHlvdXIgb3duIGNvbnRlbnQuIENoZWNrIHlvdXIgbWFpbGJveC48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+YXNkZjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD4xMDwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5MYXZlbmRlci48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+MTAvNiAxMjowMzwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD41IDA5PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPjExIDQ3PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPkhhbGxvd2VlbiBpcyBpbiBPY3RvYmVyLjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5qb2c8L3A+DQogIDwvYm9keT4NCjwvaHRtbD4=",
"_content_length": 0,
"_content_type": "html",
"_date": "0001-01-01T00:00:00",
"_title": "Topic10"
},
"delete": false,
"raw_content": "<h1>Topic10</h1><p>Delete this text and replace it with your own content. Check your mailbox.</p><p> </p><p>asdf</p><p> </p><p>10</p><p> </p><p>Lavender.</p><p> </p><p>10/6 12:03</p><p> </p><p>5 09</p><p> </p><p>11 47</p><p> </p><p>Halloween is in October.</p><p> </p><p>jog</p>"
},
"highlight": {
"file.content": [
"\n <em>Topic10</em>\n\n Delete this text and replace it with your own content. Check your mailbox.\n\n \n\n asdf\n\n \n\n 10\n\n \n\n Lavender.\n\n \n\n 10/6 12:03\n\n \n\n 5 09\n\n \n\n 11 47\n\n \n\n Halloween is in October.\n\n \n\n jog\n\n "
]
}
这篇关于如何修改Elasticsearch文档的_source字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!