对solr中的术语频率给予较少的权重? [英] Give less weight to term frequency in solr?

查看:173
本文介绍了对solr中的术语频率给予较少的权重?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何更改Solr的评分功能,以减少对词频的加权?

How do I change the scoring function of Solr to give less weight to "term frequency"?

我正在使用类似pagerank的文档提升作为相关因子。我的搜索索引目前将许多垃圾邮件文件或清理不当的文件放在首位。

I am using a pagerank-like document boost as a relevancy factor. My search index currently puts many documents that are "spammy" or not well-cleaned up and have repetitive words on top.

我知道该分数是根据词频(搜索词在文档中的频率),文档频率的倒数及其他(如何对文档评分?)。我可以增加提升,但这也不会强调其他因素。

I know the score is calculated by term frequency (how often a search term is in the document), inverse document frequency, and others (How are documents scored?). I could just increase the boost, but that would disemphasize the other factors, too.

是在查询时指定函数的方式(默认值是什么)功能),还是我必须更改配置并重新编制索引?我将django-haystack与solr一起使用,如果有区别的话。

Is the way to go to specify a function at query time (and what is the default function), or do I have to change the configuration and reindex? I am using django-haystack with solr, if it makes a difference.

推荐答案

我不确定这是最好的方法做到这一点,但这似乎可行。我在Java中创建了 Similarity 的子类。在 ClassicSimilarity 中,术语频率定义为 sqrt(freq)。添加乘法因子没有意义,因为tf与其他项相乘,而不是相加-比例因子将被统一应用。即 scale * a * b 没有任何意义, scale * a + b 则没有意义。但是在这种情况下,您可以做的是 a ^ scale * b 。这基本上是在对数中应用比例因子: log(score)= scale * log(a)+ log(b)

I'm not sure this is the best way to do it, but this seems to work. I create a subclass of Similarity in java. In ClassicSimilarity, term frequency is defined as sqrt(freq). It doesn't make sense to add a multiplicative factor, since tf is multiplied with other terms, not added - the scale factor would just be uniformly applied. I.e. scale * a * b doesn't make sense, scale * a + b would. But what you can do in this case is a^scale * b. What this basically does is it applies a scale factor in the logarithm: log(score) = scale * log(a) + log(b).

还请注意,默认相似功能毕竟似乎不是TF-IDF,而是BM25。

Also note that the default similarity function doesn't seem to be TF-IDF after all, but BM25. This here is a variation of TF-IDF.

package com.example.solr;
import org.apache.lucene.search.similarities.ClassicSimilarity;

public class CustomSimilarity extends ClassicSimilarity {
    @Override
    public float tf(float freq) {
        return (float) Math.pow(freq, 0.25); // default: 0.5
    }

    @Override
    public String toString() {
        return "CustomSimularity";
    }
}

使用以下命令编译它:

javac -cp /path/to/solr-6.6.1/server/solr-webapp/webapp/WEB-INF/lib/lucene-core-6.6.1.jar:. -d . CustomSimilarity.java
jar -cvf myscorer.jar com

然后,添加到 solrconfig.xml

<lib path="/path/to/myscorer.jar" />

schema.xml 中:

<similarity class="com.example.solr.CustomSimilarity">
</similarity>

重新启动solr之后,您可以验证新的相似性类是否在 http:// localhost:8983 / solr /#/< corename> / schema 。

After restarting solr, you can verify that the new similarity class is being used under http://localhost:8983/solr/#/<corename>/schema.

这篇关于对solr中的术语频率给予较少的权重?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆