对solr中的术语频率给予较少的权重? [英] Give less weight to term frequency in solr?
问题描述
如何更改Solr的评分功能,以减少对词频的加权?
How do I change the scoring function of Solr to give less weight to "term frequency"?
我正在使用类似pagerank的文档提升作为相关因子。我的搜索索引目前将许多垃圾邮件文件或清理不当的文件放在首位。
I am using a pagerank-like document boost as a relevancy factor. My search index currently puts many documents that are "spammy" or not well-cleaned up and have repetitive words on top.
我知道该分数是根据词频(搜索词在文档中的频率),文档频率的倒数及其他(如何对文档评分?)。我可以增加提升,但这也不会强调其他因素。
I know the score is calculated by term frequency (how often a search term is in the document), inverse document frequency, and others (How are documents scored?). I could just increase the boost, but that would disemphasize the other factors, too.
是在查询时指定函数的方式(默认值是什么)功能),还是我必须更改配置并重新编制索引?我将django-haystack与solr一起使用,如果有区别的话。
Is the way to go to specify a function at query time (and what is the default function), or do I have to change the configuration and reindex? I am using django-haystack with solr, if it makes a difference.
推荐答案
我不确定这是最好的方法做到这一点,但这似乎可行。我在Java中创建了 Similarity
的子类。在 ClassicSimilarity
中,术语频率定义为 sqrt(freq)
。添加乘法因子没有意义,因为tf与其他项相乘,而不是相加-比例因子将被统一应用。即 scale * a * b
没有任何意义, scale * a + b
则没有意义。但是在这种情况下,您可以做的是 a ^ scale * b
。这基本上是在对数中应用比例因子: log(score)= scale * log(a)+ log(b)
。
I'm not sure this is the best way to do it, but this seems to work. I create a subclass of Similarity
in java. In ClassicSimilarity
, term frequency is defined as sqrt(freq)
. It doesn't make sense to add a multiplicative factor, since tf is multiplied with other terms, not added - the scale factor would just be uniformly applied. I.e. scale * a * b
doesn't make sense, scale * a + b
would. But what you can do in this case is a^scale * b
. What this basically does is it applies a scale factor in the logarithm: log(score) = scale * log(a) + log(b)
.
还请注意,默认相似功能毕竟似乎不是TF-IDF,而是BM25。
Also note that the default similarity function doesn't seem to be TF-IDF after all, but BM25. This here is a variation of TF-IDF.
package com.example.solr;
import org.apache.lucene.search.similarities.ClassicSimilarity;
public class CustomSimilarity extends ClassicSimilarity {
@Override
public float tf(float freq) {
return (float) Math.pow(freq, 0.25); // default: 0.5
}
@Override
public String toString() {
return "CustomSimularity";
}
}
使用以下命令编译它:
javac -cp /path/to/solr-6.6.1/server/solr-webapp/webapp/WEB-INF/lib/lucene-core-6.6.1.jar:. -d . CustomSimilarity.java
jar -cvf myscorer.jar com
然后,添加到 solrconfig.xml
:
<lib path="/path/to/myscorer.jar" />
和 schema.xml
中:
<similarity class="com.example.solr.CustomSimilarity">
</similarity>
重新启动solr之后,您可以验证新的相似性类是否在 http:// localhost:8983 / solr /#/< corename> / schema
。
After restarting solr, you can verify that the new similarity class is being used under http://localhost:8983/solr/#/<corename>/schema
.
这篇关于对solr中的术语频率给予较少的权重?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!