Solr TF vs所有条款比赛 [英] Solr TF vs All Terms match

查看:49
本文介绍了Solr TF vs所有条款比赛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经观察到Solr/Lucene给出的权重太大,无法匹配特定查询词的tf上的所有查询词.例如
说我们的查询是:文本:(红色"夹克"红色夹克")
文档A->包含外套" 40次
文档 B -> 包含红色夹克"1 次(因此红色"1 次和夹克"1 次)

I have observed that Solr/Lucene gives too much weightage to matching all the query terms over tf of a particular query term. e.g.
Say our query is : text: ("red" "jacket" "red jacket")
Document A -> contains "jacket" 40 times
Document B -> contains "red jacket" 1 time (and because of this "red" 1 time and "jacket" 1 time as well)

文档B的得分更高,因为它包含了查询的所有三个词,但只有一次,而文档A的得分很低,即使它包含一个查询次数很多.

Document B is getting much higher score as its containing all the three terms of the query but just once whereas Document A is getting very low score even though it contains one term large number of times.

我是否可以通过以下方式创建查询:如果Lucene找到红色外套"的匹配项,则不会将其单独视为红色"和夹克"的匹配项?

Can I create a query in such a manner that if Lucene finds a match for "red jacket" it does not consider it as match for "red" and "jacket" individually ?

推荐答案

Tf-idf是搜索引擎通常会执行的操作,而不是您一直想要的.如果您想忽略重复的关键字,那不是您想要的.

Tf-idf is what search engines normally do but not what you always want. It is not what you want if you want to ignore repeated key words.

Tf-idf计算为的乘积:tf x idf.tf(术语频率)是单词在文本中的频率.idf(反向文档频率)表示单词在搜索引擎中所有文档中的唯一性.

Tf-idf is calculated as the product of to factors: tf x idf. tf (term frequency) is how frequent a word is in a text. idf (inverse document frequency) means how unique a word is among all documents that you have in a search engine.

考虑一个包含100个单词的文本,其中单词cat出现3次.那么cat的术语频率(即tf)为(3/100)= 0.03.现在,假设我们有1000万个文档,其中一千个出现了cat一词.然后,反文档频率(即,idf)被计算为log(10,000,000/1,000)= 4.因此,Tf-idf权重是这些量的乘积:0.03 * 4 = 0.12.参见示例的原始来源.

Consider a text containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12. See original source of example.

忽略tf-idf的最好方法可能是Solr存在函数,可以通过bf关联增强参数进行访问.例如:

The best way to ignore tf-idf is probably the Solr exists function, which is accessible through the bf relevance boost parameter. For example:

bf = if(exists(query(location:A)),5,if(exists(query(location:B)),3,0))

请参见原始来源和第二个示例的上下文.

See original source and context of second example.

这篇关于Solr TF vs所有条款比赛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆