Lucene搜索的复杂性 [英] Complexity of a Lucene's search

查看：105 发布时间：2020/5/4 7:34:14 lucene complexity-theory

本文介绍了Lucene搜索的复杂性的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果我编写并使用Lucene执行搜索的算法，如何说明其计算复杂性?我知道Lucene使用tf * idf评分，但是我不知道它是如何实现的.我发现tf * idf具有以下复杂性:

If I write and algorithm that performs a search using Lucene how can I state the computational complexity of it? I know that Lucene uses tf*idf scoring but I don't know how it is implemented. I've found that tf*idf has the following complexity:

O(|D|+|T|)

其中D是文档集合，T是所有术语的集合.

where D is the set of documents and T the set of all terms.

但是，我需要一个可以检查这是否正确并向我解释原因的人.

However, I need someone who could check if this is correct and explain me why.

谢谢

推荐答案

Lucene基本上使用带有tf-idf方案的Vector Space Model(VSM).因此，在标准设置中，我们有:

Lucene basically uses a Vector Space Model (VSM) with a tf-idf scheme. So, in the standard setting we have:

一组文档，每个文档都表示为一个向量
也以矢量表示的文本查询

我们在查询q上确定具有最高矢量空间得分的集合的K个文档.通常，我们按得分从高到低的顺序搜索这K个顶级文档；例如，许多搜索引擎使用K = 10来检索十个最佳结果的第一页并对其进行排名.

We determine the K documents of the collection with the highest vector space scores on the query q. Typically, we seek these K top documents ordered by score in decreasing order; for instance many search engines use K = 10 to retrieve and rank-order the first page of the ten best results.

计算向量空间得分的基本算法是:

The basic algorithm for computing vector space scores is:

float Scores[N] = 0
Initialize Length[N]
for each query term t
do calculate w(t,q) and fetch postings list for t (stored in the index)
    for each pair d,tf(t,d) in postings list
    do Scores[d] += wf(t,d) X w(t,q)  (dot product)
Read the array Length[d]
for each d
do Scored[d] = Scores[d] / Length[d]
return Top K components of Scores[]

哪里

数组Length保存每个N的长度(归一化因子) 文档，而数组Scores保存每个文档的分数.
tf是文档中术语的词频.
w(t,q)是针对给定术语的已提交查询的权重.请注意，查询被视为bag of words，并且可以考虑权重向量(好像它是另一个文档一样).
wf(d,q)是查询和文档的对数项加权

The array Length holds the lengths (normalization factors) for each of the N documents, whereas the array Scores holds the scores for each of the documents.
tf is the term frequency of a term in a document.
w(t,q) is the weight of the submitted query for a given term. Note that query is treated as a bag of words and the vector of weights can be considered (as if it was another document).
wf(d,q) is the logarithmic term weighting for query and document

如此处所述:矢量点积的复杂度，矢量点积为O(n).这里的维数是我们词汇表中的术语数:|T|，其中T是术语集.

As described here: Complexity of vector dot-product, vector dot-product is O(n). Here the dimension is the number of terms in our vocabulary: |T|, where T is the set of terms.

因此，该算法的时间复杂度为:

So, the time complexity of this algorithm is:

O(|Q|· |D| · |T|) = O(|D| · |T|)

我们考虑| Q |固定，其中Q是查询中的单词集合(平均大小较小，平均一个查询包含2到3个词)，而D是所有文档的集合.

we consider |Q| fixed, where Q is the set of words in the query (which average size is low, in average a query has between 2 and 3 terms) and D is the set of all documents.

但是，对于搜索而言，这些集合是有界的，索引不会经常增长.因此，结果是，使用VSM进行搜索的速度确实非常快(当T和D很大时，搜索速度实际上很慢，必须寻找一种替代方法).

However, for a search, these sets are bounded and indexes don't tend to grow very often. So, as a result, searches using VSM are really fast (when T and D are large the search is really slow and one has to find an alternative approach).

这篇关于Lucene搜索的复杂性的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Lucene搜索的复杂性 [英] Complexity of a Lucene's search

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Lucene搜索的复杂性 [英] Complexity of a Lucene&#39;s search

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Lucene搜索的复杂性 [英] Complexity of a Lucene's search

登录关闭