是否可以使用特征向量查询弹性搜索? [英] Is it possible to query Elastic Search with a feature vector?

查看:146
本文介绍了是否可以使用特征向量查询弹性搜索?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要存储一个n维特征向量,例如 <1.00,0.34,0.22,...,0> 与每个文档,然后提供另一个特征向量作为查询,结果按余弦顺序排序相似。弹性搜索是可能的吗?

I'd like to store an n-dimensional feature vector, e.g. <1.00, 0.34, 0.22, ..., 0>, with each document, and then provide another feature vector as a query, with the results sorted in order of cosine similarity. Is this possible with Elastic Search?

推荐答案

我没有弹性搜索的答案,因为我从来没有使用过(我使用的是Lucene,在其上构建弹性搜索)。但是,我想给你一个通用的答案你的问题。
有两种标准方法来获取给定查询向量的最近向量,如下所述:

I don't have an answer particular to Elastic Search because I've never used it (I use Lucene on which Elastic search is built). However, I'm trying to give a generic answer to your question. There are two standard ways to obtain the nearest vectors given a query vector, described as follows.

Kd树

第一种方法是借助支持最近邻查询的数据结构将向量存储在内存中,例如k-d树。 kd tree 是二叉搜索树的概括,在这个意义上,每个级别的二维搜索树将其中一个 k 维分成两部分。如果您有足够的空间加载内存中的所有积分,则可以应用最近邻居搜索算法,以获得由余弦相似度值排序的检索向量的列表。这种方法的明显缺点是它不会在信息检索中经常遇到的巨大的点集合。

The first approach is to store the vectors in memory with the help of a data structure that supports nearest neighbour queries, e.g. k-d trees. A k-d tree is a generalization of the binary search tree in the sense that every level of the binary search tree partitions one of the k dimensions into two parts. If you have enough space to load all the points in memory, it is possible to apply the nearest neighbour search algorithm on k-d trees to obtain a list of retrieved vectors sorted by the cosine similarity values. The obvious disadvantage of this method is that it does not scale to huge sets of points, as often encountered in information retrieval.

反向量化向量

第二种方法是使用反向量化向量。简单的基于范围的量化将伪术语标签分配给向量的实数,以便随后可以通过Lucene对其进行索引(或者就此而言)弹性搜索)

The second approach is to use inverted quantized vectors. A simple range-based quantization assigns pseudo-terms or labels to the real numbers of a vector so that these can then later be indexed by Lucene (or for that matter Elastic search).

例如,我们可以将标签 A 分配到范围 [0,0.1) em> B 到范围 [0.1,0.2)等等...您的问题中的示例向量然后被编码为(J,D,C,.. A)。 (因为[.9,1]是J,[0.3,0.4]是D等等)。

For example, we may assign the label A to the range [0, 0.1), B to the range [0.1, 0.2) and so on... The sample vector in your question is then encoded as (J,D,C,..A). (because [.9,1] is J, [0.3,0.4) is D and so on).

因此,实数向量被转换为字符串(可以被视为文档),因此用标准信息检索(IR)工具进行索引。查询向量也被转换成一个伪术语袋,因此可以计算一组其他类似的向量,最接近当前的类似向量(以余弦相似度或其他度量为单位)。

Consequently, a vector of real numbers is thus transformed into a string (which can be treated as a document) and hence indexed with a standard information retrieval (IR) tool. A query vector is also transformed into a bag of pseudo-terms and thus one can compute a set of other similar vectors in the collection most similar (in terms of cosine similarity or other measure) to the current one.

这种方法的主要优点是它可以很好地扩展到大量的实数向量的收集。关键的缺点是计算出的相似度值仅仅是真正的余弦相似度的近似值(由于在量化中遇到的损失)。更小的量化范围以提高索引大小为代价实现更好的性能。

The main advantage of this method is that it scales well for massive collection of real numbered vectors. The key disadvantage is that the computed similarity values are mere approximations to the true cosine similarities (due to the loss encountered in quantization). A smaller quantization range achieves better performance at the cost of increased index size.

这篇关于是否可以使用特征向量查询弹性搜索?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆