有哪些用于推荐相关文章的经过验证的真实算法? [英] What tried and true algorithms for suggesting related articles are out there?

查看:29
本文介绍了有哪些用于推荐相关文章的经过验证的真实算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很常见的情况,我敢打赌.您有一个博客或新闻网站,并且有很多文章或博客或任何您称之为的东西,并且您想在每篇文章的底部推荐其他似乎相关的内容.

Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related.

让我们假设每个项目的元数据很少.也就是说,没有标签、类别.将其视为一大块文本,包括标题和作者姓名.

Let's assume very little metadata about each item. That is, no tags, categories. Treat as one big blob of text, including the title and author name.

您如何查找可能相关的文档?

How do you go about finding the possibly related documents?

我对实际算法很感兴趣,而不是现成的解决方案,虽然我可以看看用 ruby​​ 或 python 实现的东西,或者依赖 mysql 或 pgsql.

I'm rather interested in the actual algorithm, not ready solutions, although I'd be ok with taking a look at something implemented in ruby or python, or relying on mysql or pgsql.

当前的答案非常好,但我想看到更多.也许是一两件事的一些非常简单的示例代码.

edit: the current answer is pretty good but I'd like to see more. Maybe some really bare example code for a thing or two.

推荐答案

这是一个相当大的话题——除了人们在这里提出的答案之外,我建议跟踪几个信息检索课程的教学大纲,并查看分配给他们的教科书和论文.也就是说,以下是我自己的研究生时代的简要概述:

This is a pretty big topic -- in addition to the answers people come up with here, I recommend tracking down the syllabi for a couple of information retrieval classes and checking out the textbooks and papers assigned for them. That said, here's a brief overview from my own grad-school days:

最简单的方法称为词袋.每个文档都被简化为 {word: wordcount} 对的稀疏向量,您可以在表示您的文档集的向量集上使用 NaiveBayes(或其他一些)分类器,或计算相似度每个包和其他包之间的分数(这称为 k-最近邻分类).KNN 查找速度快,但得分矩阵需要 O(n^2) 存储;但是,对于博客,n 并不是很大.对于大报纸那么大的东西,KNN 很快变得不切实际,因此即时分类算法有时会更好.在这种情况下,您可以考虑使用 排名支持向量机.SVM 很简洁,因为它们不会限制您使用线性相似性度量,而且速度仍然相当快.

The simplest approach is called a bag of words. Each document is reduced to a sparse vector of {word: wordcount} pairs, and you can throw a NaiveBayes (or some other) classifier at the set of vectors that represents your set of documents, or compute similarity scores between each bag and every other bag (this is called k-nearest-neighbour classification). KNN is fast for lookup, but requires O(n^2) storage for the score matrix; however, for a blog, n isn't very large. For something the size of a large newspaper, KNN rapidly becomes impractical, so an on-the-fly classification algorithm is sometimes better. In that case, you might consider a ranking support vector machine. SVMs are neat because they don't constrain you to linear similarity measures, and are still quite fast.

词干提取是词袋技术的常见预处理步骤;这涉及在计算词袋之前将形态相关的词(例如cat"和cats"、Bob"和Bob's"或similar"和similarly")减少到它们的词根.有很多不同的词干算法;维基百科页面有几个实现的链接.

Stemming is a common preprocessing step for bag-of-words techniques; this involves reducing morphologically related words, such as "cat" and "cats", "Bob" and "Bob's", or "similar" and "similarly", down to their roots before computing the bag of words. There are a bunch of different stemming algorithms out there; the Wikipedia page has links to several implementations.

如果词袋相似度不够好,您可以将其抽象为 N 元词袋相似度的一层,您可以在其中创建表示基于词对或三元组的文档的向量.(您可以使用 4 元组甚至更大的元组,但实际上这并没有多大帮助.)这具有产生更大向量的缺点,因此分类将花费更多的工作,但您获得的匹配会更接近句法上.OTOH,你可能不需要这个语义相似性;这对剽窃检测之类的东西更好.也可以使用 分块,或将文档缩减为轻量级解析树(有是树的分类算法),但这对于诸如作者问题之类的问题更有用(给定来源不明的文档,谁写的?").

If bag-of-words similarity isn't good enough, you can abstract it up a layer to bag-of-N-grams similarity, where you create the vector that represents a document based on pairs or triples of words. (You can use 4-tuples or even larger tuples, but in practice this doesn't help much.) This has the disadvantage of producing much larger vectors, and classification will accordingly take more work, but the matches you get will be much closer syntactically. OTOH, you probably don't need this for semantic similarity; it's better for stuff like plagiarism detection. Chunking, or reducing a document down to lightweight parse trees, can also be used (there are classification algorithms for trees), but this is more useful for things like the authorship problem ("given a document of unknown origin, who wrote it?").

也许对您的用例更有用的是概念挖掘,它涉及将单词映射到概念(使用同义词库,例如 WordNet),然后根据所用概念之间的相似性对文档进行分类.这通常最终比基于词的相似度分类更有效,因为从词到概念的映射是简化的,但预处理步骤可能相当耗时.

Perhaps more useful for your use case is concept mining, which involves mapping words to concepts (using a thesaurus such as WordNet), then classifying documents based on similarity between concepts used. This often ends up being more efficient than word-based similarity classification, since the mapping from words to concepts is reductive, but the preprocessing step can be rather time-consuming.

最后是话语解析,它涉及解析文档的语义结构;您可以像在分块文档上一样在话语树上运行相似性分类器.

Finally, there's discourse parsing, which involves parsing documents for their semantic structure; you can run similarity classifiers on discourse trees the same way you can on chunked documents.

这些几乎都涉及从非结构化文本生成元数据;在原始文本块之间进行直接比较是棘手的,因此人们首先将文档预处理为元数据.

These pretty much all involve generating metadata from unstructured text; doing direct comparisons between raw blocks of text is intractable, so people preprocess documents into metadata first.

这篇关于有哪些用于推荐相关文章的经过验证的真实算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆