全文搜索的相关性在? [英] Full-text search relevance is measured in?

查看:104
本文介绍了全文搜索的相关性在?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在制作一个测验系统,当测验者将问题插入问题库时,我将检查数据库是否有重复/非常相似的问题。

测试MySQL的 MATCH()... AGAINST(),当我测试一个100%相似的字符串时,我得到的最高相关性是30+。

那么相关性究竟是什么?引用手册


相关值是非负浮点数。零相关意味着没有相似性。相关性的计算基于行中单词的数量,该行中唯一字的数量,集合中单词的总数以及包含特定单词的文档(行)的数量。

我的问题是如果字符串是重复的,那么如何测试相关性值。如果它是100%重复,请阻止它插入问题库。但是,如果它只是如此相似,请提示测验者验证,插入与否。那我该怎么做? 30+对于100%相同的字符串不是百分比,所以我是树桩。



在此先感谢。

解决方案

andygeers正处于正确的轨道上:除了彼此之间的关系,这些数字没有经验意义,不能单独用于确定什么是或不是完全匹配 。你需要确定你自己。即使除了全文搜索排名的限制之外,还有一个悬而未决的问题,就是你认为构建完全匹配的问题。 (仅用于实际文本还是soundex匹配计数?同义词(例如,沙发还是沙发)算作匹配还是区分?是否应该尝试补偿拼写错误?等等)

如果我需要执行这种检查,我将只抓取全文搜索返回的排名最高的条目,删除任何指定的停用词,规范化空格,转换为小写,进行比较,直到我遇到一个要求进一步提炼的案例。这并不是所有的 额外的工作 - 如果你指定你的应用程序使用的语言,你可能会在这里找到可以在十几行代码中编写规范化函数的人。


I am making a quiz system, and when quizmakers insert questions into the Question Bank, I am to check the DB for duplicate / very highly similar questions.

Testing MySQL's MATCH() ... AGAINST(), the highest relevance I get is 30+, when I test against a 100% similar string.

So what exactly is the relevance? To quote the manual:

Relevance values are non-negative floating-point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word.

My problem is how to test the relevance value if a string is a duplicate. If it's 100% duplicate, prevent it from being inserter into Question Bank. But if it is only so similar, prompt the quizmaker to verify, insert or not. So how do I do that? 30+ for 100% identical string is not percentage, so I'm stump.

Thanks in advance.

解决方案

andygeers is on the right track: Those numbers have no empirical meaning other than their relations to each other and cannot be used on their own to determine what is or is not an "exact match". You need to determine that yourself. Even aside from the limitations of fulltext search ranking, there's also the open question of just what you consider to consitiute an "exact match". (Actual text only or do soundex matches count? Do synonyms (e.g., "couch" vs. "sofa") count as matching or as distinct? Should an attempt be made to compensate for misspellings? Etc.)

If I had the need to perform such a check, I would grab only the highest-ranked entry returned by the fulltext search, remove any designated stopwords, normalize whitespace, convert to lowercase, do the comparison, and leave it at that until I encountered a case that called for it to be refined further. It's not really all that much extra work - if you specify the language you're using for your application, you could probably find someone around here who could write the normalization function within a dozen or so lines of code.

这篇关于全文搜索的相关性在?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆