在索引之前检查Elasticsearch文档的相似性 [英] Check Elasticsearch document similarity before indexing

查看:161
本文介绍了在索引之前检查Elasticsearch文档的相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好一整天,我试图把头发弄掉之后,我决定从社区中获取一些意见.

Ok after having pulling my hair off all day long trying to figure that one out I decided to get some input from the community.

应该提到我是Elasticsearch的新手.

Should be mentioned that I'm fairly new to Elasticsearch.

我的想法是,我有一个包含一些文档的ES索引,并且仅当没有索引具有相似字段内容(但不一定等于)的现有文档时,才需要对新文档进行索引.

The idea is that I have an ES index containing some documents and I need to index new documents only if no existing documents with similar field content (but not necessarily equals) are already indexed.

我可以在多个字段上执行匹配查询并获得查询的整体得分,但是由于该得分不是可用最高得分的百分比,因此我不确定如何设置阈值来确定是否可以插入是否提供文件.

I can perform a match query on multiple field and get a global score for the query but since that score is not a percentage of the maximum score available I'm not sure how to set a threshold to determine if I can insert the document or not.

对于ES评分系统,我显然有些困惑. 在此先感谢您能提供的所有帮助.

I am obviously a bit confused about the ES scoring system. Thanks in advance for all the help I can get on this.

作为一个基本示例

已被索引:

{
  "title": "My first blog entry",
  "text":  "Just trying this out...",
  "date":  "2014/01/01"
}

这是新的,但是不应索引,因为字段不相等但太相似了:

This is new but should not be indexed since fields are not equals but too similar:

{
  "title": "My first blog entries",
  "text":  "Just trying it out...",
  "date":  "2014/01/01"
}

这是新的,应该建立索引:

This is new and should be indexed:

{
  "title": "My second entry for this blog",
  "text":  "I am just trying out a few things",
  "date":  "2014/01/01"
}

因此,它基本上是在对先前的索引进行重复数据删除,并基于我之后的字段相似性:)

So it's basically deduping prior indexing and based on fields similarity that I am after :)

推荐答案

在这种查询中,您可以在like字段中提供人工文档,这些文档将与索引中的文档相匹配,以实现相似性.默认情况下,他们将使用所有可用字段,但您也可以选择要比较的字段数量有限.

In such query, you can provide artificial documents in the like field, that will be matched against documents in your index for similarity. By default they will use all available fields, but you can select a limited number of fields to be compared as well.

在大多数情况下,此查询用于检索类似于用户可能正在查看或用户已选择的一个或几个文档的文档.尽管如此,您仍可以使用此功能来分析返回文档的分数(如果有),并决定是否对您的文档建立索引.

Most of the time, this query is used to retrieve documents similar to one or a few documents that the user might be looking at, or that the user has selected. Nonetheless, you can probably use this feature to analyze the score of the returned documents (if any) and decide wether to index your document or not.

有关完整的参数列表,请参阅上面链接的文档页面.

Please refer to the documentation page linked above for a comprehensive list of parameters.

这篇关于在索引之前检查Elasticsearch文档的相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆