如何分组/比较类似的新闻文章 [英] How to group / compare similar news articles

查看:110
本文介绍了如何分组/比较类似的新闻文章的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我正在创建的应用中,我想添加将新闻故事分组在一起的功能.我想将来自不同来源的有关同一主题的新闻报道归为一组.例如,来自CNN和MSNBC的关于XYZ的文章将在同一组中.我猜想它是某种模糊逻辑比较.从技术角度来看,我将如何去做呢?我有什么选择?我们甚至还没有启动该应用程序,因此我们对可以使用的技术没有任何限制.

In an app that i'm creating, I want to add functionality that groups news stories together. I want to group news stories about the same topic from different sources into the same group. For example, an article on XYZ from CNN and MSNBC would be in the same group. I am guessing its some sort of fuzzy logic comparison. How would I go about doing this from a technical standpoint? What are my options? We haven't even started the app yet, so we aren't limited in the technologies we can use.

谢谢,提前获得帮助!

推荐答案

从机器学习的角度来看,此问题分为几个子问题.

This problem breaks down into a few subproblems from a machine learning standpoint.

首先,您将要弄清楚要根据其分类的新闻报道的哪些属性.一种常见的技术是使用单词袋" :只是出现在故事正文或标题中的内容.您可以进行一些其他处理,例如删除不提供任何含义的常用英语"停用词",例如为"the","because".您甚至可以执行 porter stemming 来删除带有复数单词和单词结尾(例如"-ion")的冗余.这个单词列表是每个文档的特征向量,将用于度量相似度.您可能需要做一些预处理才能删除html标记.

First, you are going to want to figure out what properties of the news stories you want to group based on. A common technique is to use 'word bags': just a list of the words that appear in the body of the story or in the title. You can do some additional processing such as removing common English "stop words" that provide no meaning, such as "the", "because". You can even do porter stemming to remove redundancies with plural words and word endings such as "-ion". This list of words is the feature vector of each document and will be used to measure similarity. You may have to do some preprocessing to remove html markup.

第二,您必须定义相似度指标:相似故事的相似度很高.顺便说一句,如果两个故事中的单词相似,那么两个故事是相似的(我在这里含糊不清,因为您可以尝试很多东西,并且您必须看看哪个效果最好).

Second, you have to define a similarity metric: similar stories score high in similarity. Going along with the bag of words approach, two stories are similar if they have similar words in them (I'm being vague here, because there are tons of things you can try, and you'll have to see which works best).

最后,您可以使用经典的聚类算法,例如 k均值聚类 ,根据相似性指标将故事分组在一起.

Finally, you can use a classic clustering algorithm, such as k-means clustering, which groups the stories together, based on the similarity metric.

总而言之:将新闻故事转换为特征向量->根据此特征向量定义相似性指标->无监督聚类.

In summary: convert news story into a feature vector -> define a similarity metric based on this feature vector -> unsupervised clustering.

查看 Google学者,在最近的文献中可能有一些关于该特定主题的论文.我刚才讨论的许多事情都是在大多数主要语言的自然语言处理和机器学习模块中实现的.

Check out Google scholar, there probably have been some papers on this specific topic in the recent literature. A lot of these things that I just discussed are implemented in natural language processing and machine learning modules for most major languages.

这篇关于如何分组/比较类似的新闻文章的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆