Lucene搜索的复杂性 [英] Complexity of a Lucene's search
问题描述
如果我编写并使用Lucene执行搜索的算法,如何说明其计算复杂性?我知道Lucene使用tf * idf评分,但是我不知道它是如何实现的.我发现tf * idf具有以下复杂性:
If I write and algorithm that performs a search using Lucene how can I state the computational complexity of it? I know that Lucene uses tf*idf scoring but I don't know how it is implemented. I've found that tf*idf has the following complexity:
O(|D|+|T|)
其中D是文档集合,T是所有术语的集合.
where D is the set of documents and T the set of all terms.
但是,我需要一个可以检查这是否正确并向我解释原因的人.
However, I need someone who could check if this is correct and explain me why.
谢谢
推荐答案
Lucene基本上使用带有tf-idf
方案的Vector Space Model
(VSM).因此,在标准设置中,我们有:
Lucene basically uses a Vector Space Model
(VSM) with a tf-idf
scheme. So, in the standard setting we have:
- 一组文档,每个文档都表示为一个向量
- 也以矢量表示的文本查询
我们在查询q
上确定具有最高矢量空间得分的集合的K
个文档.通常,我们按得分从高到低的顺序搜索这K个顶级文档;例如,许多搜索引擎使用K = 10来检索十个最佳结果的第一页并对其进行排名.
We determine the K
documents of the collection with the highest vector space scores on the query q
. Typically, we seek these K top documents ordered by score in decreasing order; for instance many search engines use K = 10 to retrieve and rank-order the first page of the ten best results.
计算向量空间得分的基本算法是:
The basic algorithm for computing vector space scores is:
float Scores[N] = 0
Initialize Length[N]
for each query term t
do calculate w(t,q) and fetch postings list for t (stored in the index)
for each pair d,tf(t,d) in postings list
do Scores[d] += wf(t,d) X w(t,q) (dot product)
Read the array Length[d]
for each d
do Scored[d] = Scores[d] / Length[d]
return Top K components of Scores[]
哪里
- 数组
Length
保存每个N
的长度(归一化因子) 文档,而数组Scores
保存每个文档的分数. -
tf
是文档中术语的词频. -
w(t,q)
是针对给定术语的已提交查询的权重.请注意,查询被视为bag of words
,并且可以考虑权重向量(好像它是另一个文档一样). -
wf(d,q)
是查询和文档的对数项加权
- The array
Length
holds the lengths (normalization factors) for each of theN
documents, whereas the arrayScores
holds the scores for each of the documents. tf
is the term frequency of a term in a document.w(t,q)
is the weight of the submitted query for a given term. Note that query is treated as abag of words
and the vector of weights can be considered (as if it was another document).wf(d,q)
is the logarithmic term weighting for query and document
如此处所述:矢量点积的复杂度,矢量点积为O(n)
.这里的维数是我们词汇表中的术语数:|T|
,其中T
是术语集.
As described here: Complexity of vector dot-product, vector dot-product is O(n)
. Here the dimension is the number of terms in our vocabulary: |T|
, where T
is the set of terms.
因此,该算法的时间复杂度为:
So, the time complexity of this algorithm is:
O(|Q|· |D| · |T|) = O(|D| · |T|)
我们考虑| Q |固定,其中Q
是查询中的单词集合(平均大小较小,平均一个查询包含2到3个词),而D
是所有文档的集合.
we consider |Q| fixed, where Q
is the set of words in the query (which average size is low, in average a query has between 2 and 3 terms) and D
is the set of all documents.
但是,对于搜索而言,这些集合是有界的,索引不会经常增长.因此,结果是,使用VSM进行搜索的速度确实非常快(当T
和D
很大时,搜索速度实际上很慢,必须寻找一种替代方法).
However, for a search, these sets are bounded and indexes don't tend to grow very often. So, as a result, searches using VSM are really fast (when T
and D
are large the search is really slow and one has to find an alternative approach).
这篇关于Lucene搜索的复杂性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!