推荐相关文章的尝试算法有哪些? [英] What tried and true algorithms for suggesting related articles are out there?

查看:210
本文介绍了推荐相关文章的尝试算法有哪些?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我敢打赌,这是很常见的情况.您有一个博客或新闻网站,并且有大量的文章或标语或任何您所谓的名称,并且您想在每个内容的底部建议其他似乎相关的内容.

Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related.

让我们假设关于每个项目的元数据很少.也就是说,没有标签,类别.视为标题,作者名称等一大块文字.

Let's assume very little metadata about each item. That is, no tags, categories. Treat as one big blob of text, including the title and author name.

您如何查找可能相关的文件?

How do you go about finding the possibly related documents?

我对真正的算法感兴趣,而不是现成的解决方案,尽管我可以看看用ruby或python实现的东西,还是依赖mysql或pgsql.

I'm rather interested in the actual algorithm, not ready solutions, although I'd be ok with taking a look at something implemented in ruby or python, or relying on mysql or pgsql.

编辑:当前答案还不错,但我想了解更多.也许一两个东西的一些真正的示例代码.

edit: the current answer is pretty good but I'd like to see more. Maybe some really bare example code for a thing or two.

推荐答案

这是一个很大的话题-除了人们在此处提出的答案之外,我建议您针对两个信息检索类的大纲进行追踪,并查看分配给他们的教科书和论文.也就是说,这是我自己读研究生时的简要概述:

This is a pretty big topic -- in addition to the answers people come up with here, I recommend tracking down the syllabi for a couple of information retrieval classes and checking out the textbooks and papers assigned for them. That said, here's a brief overview from my own grad-school days:

最简单的方法称为一揽子文字.每个文档都被简化为{word: wordcount}对的稀疏向量,并且您可以在代表您的文档集合的向量集处抛出NaiveBayes(或其他)分类器,或者计算每个袋与每个其他袋之间的相似性分数(这称为k最近邻分类). KNN查找速度很快,但是需要O(n ^ 2)来存储分数矩阵.但是,对于博客而言,n并不是很大.对于大型报纸而言,KNN很快就变得不切实际,因此动态分类算法有时会更好.在这种情况下,您可以考虑使用排名支持向量机. SVM整洁,因为它们不限制您进行线性相似性度量,而且速度仍然很快.

The simplest approach is called a bag of words. Each document is reduced to a sparse vector of {word: wordcount} pairs, and you can throw a NaiveBayes (or some other) classifier at the set of vectors that represents your set of documents, or compute similarity scores between each bag and every other bag (this is called k-nearest-neighbour classification). KNN is fast for lookup, but requires O(n^2) storage for the score matrix; however, for a blog, n isn't very large. For something the size of a large newspaper, KNN rapidly becomes impractical, so an on-the-fly classification algorithm is sometimes better. In that case, you might consider a ranking support vector machine. SVMs are neat because they don't constrain you to linear similarity measures, and are still quite fast.

Stemming 是词袋技术的常见预处理步骤;这涉及在计算单词袋之前,将与词法相关的词(例如猫"和猫",鲍勃"和鲍勃"或相似"和相似")减少到其词根.有很多不同的词干算法. Wikipedia页面上有一些实现的链接.

Stemming is a common preprocessing step for bag-of-words techniques; this involves reducing morphologically related words, such as "cat" and "cats", "Bob" and "Bob's", or "similar" and "similarly", down to their roots before computing the bag of words. There are a bunch of different stemming algorithms out there; the Wikipedia page has links to several implementations.

如果单词袋相似度不够好,则可以将其抽象为N-grams袋相似度,然后在此基础上创建基于单词对或单词三对表示文档的向量. (您可以使用4元组甚至更大的元组,但实际上并没有太大帮助.)这具有产生较大矢量的缺点,因此分类会花费更多的工作,但是得到的匹配会更接近句法上. OTOH,出于语义相似性,您可能不需要它.最好用于窃检测之类的东西.也可以使用 Chunking 或将文档缩减为轻量级的分析树(是树木的分类算法),但是对于诸如作者身份问题(给了未知来源的文件,是谁写的?"之类的东西)而言,这更为有用.

If bag-of-words similarity isn't good enough, you can abstract it up a layer to bag-of-N-grams similarity, where you create the vector that represents a document based on pairs or triples of words. (You can use 4-tuples or even larger tuples, but in practice this doesn't help much.) This has the disadvantage of producing much larger vectors, and classification will accordingly take more work, but the matches you get will be much closer syntactically. OTOH, you probably don't need this for semantic similarity; it's better for stuff like plagiarism detection. Chunking, or reducing a document down to lightweight parse trees, can also be used (there are classification algorithms for trees), but this is more useful for things like the authorship problem ("given a document of unknown origin, who wrote it?").

对于您的用例而言,也许更有用的是概念挖掘,它涉及将单词映射到概念(使用诸如 WordNet ),然后根据所用概念之间的相似性对文档进行分类.由于从单词到概念的映射是还原性的,所以这通常最终比基于单词的相似性分类更有效,但是预处理步骤可能会非常耗时.

Perhaps more useful for your use case is concept mining, which involves mapping words to concepts (using a thesaurus such as WordNet), then classifying documents based on similarity between concepts used. This often ends up being more efficient than word-based similarity classification, since the mapping from words to concepts is reductive, but the preprocessing step can be rather time-consuming.

最后,有话语解析,其中涉及解析文档的语义结构;您可以像在分块文档上一样,在话语树上运行相似性分类器.

Finally, there's discourse parsing, which involves parsing documents for their semantic structure; you can run similarity classifiers on discourse trees the same way you can on chunked documents.

几乎所有这些都涉及从非结构化文本生成元数据;在原始文本块之间进行直接比较是很棘手的,因此人们首先将文档预处理为元数据.

These pretty much all involve generating metadata from unstructured text; doing direct comparisons between raw blocks of text is intractable, so people preprocess documents into metadata first.

这篇关于推荐相关文章的尝试算法有哪些?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆