查找具有相似文本的文章的算法 [英] Algorithm to find articles with similar text

查看:142
本文介绍了查找具有相似文本的文章的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在数据库中有很多文章(带有标题,文本),我正在寻找一种算法来查找X个最相似的文章,例如当您提出问题时类似Stack Overflow的相关问题".

I have many articles in a database (with title,text), I'm looking for an algorithm to find the X most similar articles, something like Stack Overflow's "Related Questions" when you ask a question.

为此,我尝试进行谷歌搜索,但只找到有关其他相似文本"问题的页面,例如将每篇文章与所有其他文章进行比较并将相似之处存储在某个地方.因此,我会在我刚刚键入的文本上实时"执行此操作.

I tried googling for this but only found pages about other "similar text" issues, something like comparing every article with all the others and storing a similarity somewhere. SO does this in "real time" on text that I just typed.

如何?

推荐答案

编辑距离是'考虑到您实际上会对搜索感兴趣的文档的大小和数量,这将是一个可能的候选者,因为它取决于拼写/单词顺序,并且比Will会让您相信的计算量大得多.

Edit distance isn't a likely candidate, as it would be spelling/word-order dependent, and much more computationally expensive than Will is leading you to believe, considering the size and number of the documents you'd actually be interested in searching.

像Lucene这样的东西是要走的路.您为所有文档建立索引,然后在要查找与给定文档相似的文档时,将给定文档转换为查询并搜索索引.内部Lucene将使用 tf-idf

Something like Lucene is the way to go. You index all your documents, and then when you want to find documents similar to a given document, you turn your given document into a query, and search the index. Internally Lucene will be using tf-idf and an inverted index to make the whole process take an amount of time proportional to the number of documents that could possibly match, not the total number of documents in the collection.

这篇关于查找具有相似文本的文章的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆