在弹性搜索中找到最相似的整数数组 [英] Finding most similar arrays of integers in elasticsearch

查看:89
本文介绍了在弹性搜索中找到最相似的整数数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

REWRITTEN:



在我的项目中,我有图像。每个图像都有5个范围[1,10]的标签。我使用Elasticsearch上传这些标签:



我将这些文档加载到索引my_project中的类型为img的弹性搜索:

  curl -XPUT'http:// localhost:9200 / my_project / img / 1'-d'
{tags:[1,4, 6,7,9]}
'

我上传的其他示例文档:

  {tags:[1,4,6,7]} 
{tags:[2,3, 5,6]}
{tags:[1,2,3,8]}

在我的应用程序中,向量长得多,但固定数量的唯一元素。而且我喜欢20M这些文件。



现在我想为给定的向量找到类似的文档。当它们具有更常见的标签时,向量更相似。所以例如我想为整数向量 [1,2,3,7] 找到最相似的文档。最佳匹配应为最后一个示例文档 {tags:[1,2,3,8]} ,因为它们在其标签中共享3个常用值, [1,2,3] ,比任何其他向量更常见的值。



所以这里是我的问题。如果我以上面的CURL命令上传文件,我得到这个映射:

  {
my_project:{
mappings:{
img:{
properties:{
tags:{
type:string
}
}
}
}
}
}

但我认为正确的映射应该使用整数而不是字符串。如何对这种类型的数据进行正确的显式映射?



现在我想使用上述相似性算法搜索文档。如何获取上述类型的100个最相似的文档,具有上述相似性算法?如果我将这些向量转换为空格分隔的数字的字符串,那么我可以使用布尔查询与这个搜索的should语句,但我认为使用整数数组应该更快。你能告诉我,如何构建这个搜索查询的弹性搜索?






我的解决方案到目前为止



我现在使用的基本解决方案是将整数数组转换为字符串。所以我保存文件:

  curl -XPUT'http:// localhost:9200 / my_project / img / 1'-d '
{tags:1 4 6 7 9}
'

然后基本上搜索字符串1 2 3。虽然这可以以某种方式工作,我认为将整数数组保存为整数数组而不是字符串将会更正确和更快。在弹性搜索中使用整数数组可以使用整数数组吗?也许我使用字符串的方法是最好的,不能/不必在elasticsearch中显式使用整数数组。

解决方案

从去年开始,我们就从去年的Elasticsearch邮件列表中看出这个讨论。另一个ES用户试图完成你正在尝试做的事,匹配数组元素,并按相似性进行排序。在他的情况下,他的数组成员是一个,两个,三等,但它几乎相同:



http://elasticsearch-users.115913.n3.nabble.com/Similarity-score-in -array-td4041674.html



讨论中提到的问题是没有什么可以让您准确了解您想要的开箱即用。您使用数组成员的方法(字符串或整数,我认为两者都会很好)会让您接近,但可能会与您要实现的内容有一些差异。原因是Elasticsearch(和Lucene / Solr)中的默认相似性评分机制是TF / IDF: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/relevance-intro.html


$ b $ TF / IDF可能相当接近,根据用例可能会给您相同的结果,但不能保证这样做。一个非常频繁出现的标签(假设1的频率是2的两倍)会改变每个术语的权重,这样您可能无法准确找到所需的内容。



如果您需要确切的得分/相似性算法,我相信您将需要自定义分数。正如你发现一个自定义的得分脚本将不会很好地扩展,因为该脚本将被运行为每一个文档,所以开始并不会太快,并将以响应时间以线性方式衰减。



我可能会尝试弹性搜索提供的一些相似性模块,如BM25:



http://www.elasticsearch.org/guide/en/elasticsearch/ reference / current / index-modules-similarity.html


REWRITTEN:

In my project I have images. Each image have 5 tags from range [1,10]. I used Elasticsearch to upload these tags:

I have these documents loaded into elasticsearch in index "my_project" with type "img":

curl -XPUT 'http://localhost:9200/my_project/img/1' -d '
 {"tags": [1,4,6,7,9]}
'

Other example documents I upload:

{"tags": [1,4,6,7]}
{"tags": [2,3,5,6]}
{"tags": [1,2,3,8]}

In my application, vectors are much longer, but with fixed number of unique elements. And I have like 20M of these documents.

Now I want to find similar documents for given vector. Vectors are more similar when they have more common tags. So for example I want to find most similar document for integer vector [1,2,3,7]. The best match should be last example document {"tags": [1,2,3,8]}, since they share 3 common values in their tags, [1,2,3], more common values than with any other vectors.

So here are my problems. If I upload documents with above CURL command, I get this mapping:

{
  "my_project" : {
    "mappings" : {
      "img" : {
        "properties" : {
          "tags" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

But I think that correct mapping should use integers instead of strings. How can I make correct explicit mapping for this type of data?

Now I want to search documents with above similarity algorithm. How can I get 100 most similar documents of above type with similarity algorithm explained above? If I convert these vectors into string with whitespace-separated numbers, I would be able to use boolean query with should statements for this search, but I think that using arrays of integers should be faster. Can you tell me, how can I construct that search query for elasticsearch?


My solution so far

Basic solution I use now is to convert integer array into string. So I save documents as:

curl -XPUT 'http://localhost:9200/my_project/img/1' -d '
 {"tags": "1 4 6 7 9"}
' 

and then basically search for string "1 2 3". While this works somehow, I think that it would be more correct and faster to save array of integers as array of integers, not strings. Is it possible to work with arrays of integers in elasticsearch as with arrays of integers? Maybe my approach with strings is best and can't/don't have to use integer arrays explicitly in elasticsearch.

解决方案

I'd take a look at this discussion from last year on the Elasticsearch mailing list from last year. Another ES user was trying to do exactly what you are trying to do, match array elements and sort by similarity. In his case his array members were "one", "two", "three" etc but it's pretty much identical:

http://elasticsearch-users.115913.n3.nabble.com/Similarity-score-in-array-td4041674.html

The problem as noted in the discussion is nothing is going to get you exactly what you want out of the box. Your approach to using the array members (either string or integer, I think both will be fine) will get you close, but will likely have some differences from what you are seeking to achieve. The reason is that the default similarity scoring mechanism in Elasticsearch (and Lucene/Solr too) is TF/IDF: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/relevance-intro.html

TF/IDF may be quite close and depending upon use case may give you the same results, but won't be guaranteed to do so. A tag that appears very frequently (let's say "1" had twice the frequency of "2") will change the weighting for each term such that you may not get exactly what you are looking for.

If you need your exact scoring/similarity algorithm I believe you will need to custom score it. As you've discovered a custom scoring script will not scale well as that script is going to be run for each and every document, so it's not too fast to begin with and will decay in response time in a linear fashion.

Personally I'd probably experiment with some of the similarity modules that Elasticsearch provides, like BM25:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-similarity.html

这篇关于在弹性搜索中找到最相似的整数数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆