在elasticsearch中查找最相似的整数数组 [英] Finding most similar arrays of integers in elasticsearch

查看:56
本文介绍了在elasticsearch中查找最相似的整数数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

改写:

在我的项目中,我有图像.每个图像有 5 个来自范围 [1,10] 的标签.我使用 Elasticsearch 上传这些标签:

In my project I have images. Each image have 5 tags from range [1,10]. I used Elasticsearch to upload these tags:

我已将这些文档加载到索引my_project"中的 elasticsearch 中;类型为img":

I have these documents loaded into elasticsearch in index "my_project" with type "img":

curl -XPUT 'http://localhost:9200/my_project/img/1' -d '
 {"tags": [1,4,6,7,9]}
'

我上传的其他示例文档:

Other example documents I upload:

{"tags": [1,4,6,7]}
{"tags": [2,3,5,6]}
{"tags": [1,2,3,8]}

在我的应用程序中,向量要长得多,但具有固定数量的唯一元素.我有大约 20M 这些文件.

In my application, vectors are much longer, but with fixed number of unique elements. And I have like 20M of these documents.

现在我想为给定的向量找到类似的文档.当向量具有更多共同标签时,它们更相似.例如,我想为整数向量 [1,2,3,7] 找到最相似的文档.最佳匹配应该是最后一个示例文档 {tags": [1,2,3,8]},因为它们在标签中共享 3 个共同值,[1,2,3],比任何其他向量更常见的值.

Now I want to find similar documents for given vector. Vectors are more similar when they have more common tags. So for example I want to find most similar document for integer vector [1,2,3,7]. The best match should be last example document {"tags": [1,2,3,8]}, since they share 3 common values in their tags, [1,2,3], more common values than with any other vectors.

所以这是我的问题.如果我使用上面的 CURL 命令上传文档,我会得到这个映射:

So here are my problems. If I upload documents with above CURL command, I get this mapping:

{
  "my_project" : {
    "mappings" : {
      "img" : {
        "properties" : {
          "tags" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

但我认为正确的映射应该使用整数而不是字符串.如何为此类数据进行正确的显式映射?

But I think that correct mapping should use integers instead of strings. How can I make correct explicit mapping for this type of data?

现在我想用上述相似度算法搜索文档.如何使用上述相似度算法获得上述类型的 100 个最相似的文档?如果我将这些向量转换为带有空格分隔数字的字符串,我将能够使用带有 should 语句的布尔查询进行此搜索,但我认为使用整数数组应该更快.你能告诉我,我如何为 elasticsearch 构建搜索查询?

Now I want to search documents with above similarity algorithm. How can I get 100 most similar documents of above type with similarity algorithm explained above? If I convert these vectors into string with whitespace-separated numbers, I would be able to use boolean query with should statements for this search, but I think that using arrays of integers should be faster. Can you tell me, how can I construct that search query for elasticsearch?

我现在使用的基本解决方案是将整数数组转换为字符串.所以我将文档另存为:

Basic solution I use now is to convert integer array into string. So I save documents as:

curl -XPUT 'http://localhost:9200/my_project/img/1' -d '
 {"tags": "1 4 6 7 9"}
' 

然后基本上搜索字符串"1 2 3".虽然这以某种方式起作用,但我认为将整数数组保存为整数数组而不是字符串会更正确和更快.是否可以像使用整数数组一样在 elasticsearch 中使用整数数组?也许我使用字符串的方法是最好的,并且不能/不必在 elasticsearch 中显式使用整数数组.

and then basically search for string "1 2 3". While this works somehow, I think that it would be more correct and faster to save array of integers as array of integers, not strings. Is it possible to work with arrays of integers in elasticsearch as with arrays of integers? Maybe my approach with strings is best and can't/don't have to use integer arrays explicitly in elasticsearch.

推荐答案

我想看看去年 Elasticsearch 邮件列表上的这个讨论.另一个 ES 用户正试图完全按照您的要求执行,匹配数组元素并按相似性排序.在他的情况下,他的数组成员是一"、二"、三"等,但几乎相同:

I'd take a look at this discussion from last year on the Elasticsearch mailing list from last year. Another ES user was trying to do exactly what you are trying to do, match array elements and sort by similarity. In his case his array members were "one", "two", "three" etc but it's pretty much identical:

http://elasticsearch-users.115913.n3.nabble.com/Similarity-score-in-array-td4041674.html

在讨论中指出的问题是,没有什么能让您开箱即用.您使用数组成员(字符串或整数,我认为两者都可以)的方法将使您接近,但可能与您要实现的目标存在一些差异.原因是 Elasticsearch(以及 Lucene/Solr 也是)中默认的相似度评分机制是 TF/IDF:http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/relevance-intro.html

The problem as noted in the discussion is nothing is going to get you exactly what you want out of the box. Your approach to using the array members (either string or integer, I think both will be fine) will get you close, but will likely have some differences from what you are seeking to achieve. The reason is that the default similarity scoring mechanism in Elasticsearch (and Lucene/Solr too) is TF/IDF: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/relevance-intro.html

TF/IDF 可能非常接近,并且根据用例可能会给您相同的结果,但不能保证这样做.出现非常频繁的标签(假设1"的频率是2"的两倍)会改变每个词的权重,这样您可能无法准确获得所需的内容.

TF/IDF may be quite close and depending upon use case may give you the same results, but won't be guaranteed to do so. A tag that appears very frequently (let's say "1" had twice the frequency of "2") will change the weighting for each term such that you may not get exactly what you are looking for.

如果您需要确切的评分/相似度算法,我相信您需要对其进行自定义评分.正如您发现的,自定义评分脚本无法很好地扩展,因为该脚本将针对每个文档运行,因此开始时不会太快,并且响应时间会以线性方式衰减.

If you need your exact scoring/similarity algorithm I believe you will need to custom score it. As you've discovered a custom scoring script will not scale well as that script is going to be run for each and every document, so it's not too fast to begin with and will decay in response time in a linear fashion.

我个人可能会尝试使用 Elasticsearch 提供的一些相似性模块,例如 BM25:

Personally I'd probably experiment with some of the similarity modules that Elasticsearch provides, like BM25:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-similarity.html

这篇关于在elasticsearch中查找最相似的整数数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆