在elasticsearch中查找最相似的整数数组 [英] Finding most similar arrays of integers in elasticsearch

查看：56 发布时间：2021/12/13 12:09:39 elasticsearch

本文介绍了在elasticsearch中查找最相似的整数数组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

改写:

在我的项目中，我有图像.每个图像有 5 个来自范围 [1,10] 的标签.我使用 Elasticsearch 上传这些标签:

In my project I have images. Each image have 5 tags from range [1,10]. I used Elasticsearch to upload these tags:

我已将这些文档加载到索引my_project"中的 elasticsearch 中；类型为img":

I have these documents loaded into elasticsearch in index "my_project" with type "img":

curl -XPUT 'http://localhost:9200/my_project/img/1' -d '
 {"tags": [1,4,6,7,9]}
'

我上传的其他示例文档:

Other example documents I upload:

{"tags": [1,4,6,7]}
{"tags": [2,3,5,6]}
{"tags": [1,2,3,8]}

在我的应用程序中，向量要长得多，但具有固定数量的唯一元素.我有大约 20M 这些文件.

In my application, vectors are much longer, but with fixed number of unique elements. And I have like 20M of these documents.

现在我想为给定的向量找到类似的文档.当向量具有更多共同标签时，它们更相似.例如，我想为整数向量 [1,2,3,7] 找到最相似的文档.最佳匹配应该是最后一个示例文档 {tags": [1,2,3,8]}，因为它们在标签中共享 3 个共同值，[1,2,3]，比任何其他向量更常见的值.

Now I want to find similar documents for given vector. Vectors are more similar when they have more common tags. So for example I want to find most similar document for integer vector [1,2,3,7]. The best match should be last example document {"tags": [1,2,3,8]}, since they share 3 common values in their tags, [1,2,3], more common values than with any other vectors.

所以这是我的问题.如果我使用上面的 CURL 命令上传文档，我会得到这个映射:

So here are my problems. If I upload documents with above CURL command, I get this mapping:

{
  "my_project" : {
    "mappings" : {
      "img" : {
        "properties" : {
          "tags" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

但我认为正确的映射应该使用整数而不是字符串.如何为此类数据进行正确的显式映射?

But I think that correct mapping should use integers instead of strings. How can I make correct explicit mapping for this type of data?

现在我想用上述相似度算法搜索文档.如何使用上述相似度算法获得上述类型的 100 个最相似的文档?如果我将这些向量转换为带有空格分隔数字的字符串，我将能够使用带有 should 语句的布尔查询进行此搜索，但我认为使用整数数组应该更快.你能告诉我，我如何为 elasticsearch 构建搜索查询?

Now I want to search documents with above similarity algorithm. How can I get 100 most similar documents of above type with similarity algorithm explained above? If I convert these vectors into string with whitespace-separated numbers, I would be able to use boolean query with should statements for this search, but I think that using arrays of integers should be faster. Can you tell me, how can I construct that search query for elasticsearch?

我现在使用的基本解决方案是将整数数组转换为字符串.所以我将文档另存为:

Basic solution I use now is to convert integer array into string. So I save documents as:

curl -XPUT 'http://localhost:9200/my_project/img/1' -d '
 {"tags": "1 4 6 7 9"}
'

然后基本上搜索字符串"1 2 3".虽然这以某种方式起作用，但我认为将整数数组保存为整数数组而不是字符串会更正确和更快.是否可以像使用整数数组一样在 elasticsearch 中使用整数数组?也许我使用字符串的方法是最好的，并且不能/不必在 elasticsearch 中显式使用整数数组.

and then basically search for string "1 2 3". While this works somehow, I think that it would be more correct and faster to save array of integers as array of integers, not strings. Is it possible to work with arrays of integers in elasticsearch as with arrays of integers? Maybe my approach with strings is best and can't/don't have to use integer arrays explicitly in elasticsearch.

在elasticsearch中查找最相似的整数数组 [英] Finding most similar arrays of integers in elasticsearch

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在elasticsearch中查找最相似的整数数组 [英] Finding most similar arrays of integers in elasticsearch

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭