是否存在确定文本与主题相关性的算法? [英] Is there an algorithm for determining the relevance of a text to a theme?
问题描述
我想知道什么可以用来确定页面与游戏,电影等主题的相关性.
I want to know what can be used to determine the relevance of a page for a theme like games, movies, etc.
在这方面有研究吗?还是只计算出一些相关单词出现了多少次?
Is there some research in this area or is there only counting how many times some relevant words appear?
推荐答案
常见的选择是对单词袋(或n-gram袋)功能进行监督文档分类,最好使用tf-idf加权.
The common choice is supervised document classification on bag of words (or bag of n-grams) features, preferably with tf-idf weighting.
受欢迎的算法包括朴素贝叶斯(Naive Bayes)和(线性)SVM.
Popular algorithms include Naive Bayes and (linear) SVMs.
对于这种方法,您需要标记的培训数据,即带有相关主题的文档.
For this approach, you'll need labeled training data, i.e. documents annotated with relevant themes.
例如,参见> 信息检索简介 ,第13-15章.
See, e.g., Introduction to Information Retrieval, chapters 13-15.
这篇关于是否存在确定文本与主题相关性的算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!