PostgreSQL使用tf-idf吗? [英] Does PostgreSQL use tf-idf?

查看:140
本文介绍了PostgreSQL使用tf-idf吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道PostgreSQL 9.3中带有GIN / GiST索引的全文搜索是否使用tf-idf(术语频率逆文档频率)。

特别是,在我的专栏中,我有一些更受欢迎的词,而有些词则非常独特(即名称)。我想索引这些列,以便匹配的唯一字的权重会高于常用字。

解决方案

在ts_rank函数中,没有使用全局(语料库)频率对结果进行排名的本机方法。然而,排名算法根据文档中的频率排名:

http://www.postgresql.org/docs/9.3/static/textsearch-controls.html



因此,如果我搜索狗|奇瓦瓦州,尽管奇瓦瓦一词的频率相对较低,但以下两个文档的排名相同:

 我想要一只狗
我想要吉娃娃

然而,以下代码行的排名会高于前面两行,因为它在文档中包含两个词干标记dog:

 狗爱好者平均有1.5只狗

总之:更高的期限频率在文档中导致更高的排名,但语料库中的低语词频率没有影响。



一个警告:文本搜索会忽略停用词,因此您不会匹配the,a,of等超高频词。 for等(假设你已经正确设置你的语言)

I would like to know whether full text search in PostgreSQL 9.3 with GIN/GiST index uses tf-idf (term frequency-inverse document frequency).

In particular, in my columns of phrases, I have some words that are more popular, whereas some are quite unique (i.e., names). I want to index these columns so that the unique words matched will be weighted higher than common words.

解决方案

No. Within the ts_rank function, there is no native method to rank results using their global (corpus) frequency. The rank algorithm does however rank based on frequency within the document:

http://www.postgresql.org/docs/9.3/static/textsearch-controls.html

So if I search for "dog|chihuahua" the following two documents would have the same rank despite the relatively lower frequency of the word "chihuahua":

"I want a dog"
"I want a chihuahua"

However, the following line would get ranked higher than the previous two lines above, because it contains the stemmed token "dog" twice in the document:

"dog lovers have an average of 1.5 dogs"

In short: higher term frequency within the document results in a higher rank, but a lower term frequency in the corpus has no impact.

One caveat: the text search does ignore stop-words, so you will not match on ultra high frequency words like "the","a","of","for" etc (assuming you have correctly set your language)

这篇关于PostgreSQL使用tf-idf吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆