内容自动分类 [英] Auto Categorization of Content

查看:34
本文介绍了内容自动分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个脚本,用于从我所属的特定 meetup.com 组的消息存档中提取消息 - http://www.meetup.com/opencoffee/messages/archive/

I'm developing a script that extracts the messages from the message archive of a particular meetup.com group of which I'm a member - http://www.meetup.com/opencoffee/messages/archive/

我们的想法是将这些动态添加到 wordpress 站点,并允许人们搜索消息、自动标记消息等.

The idea is to dynamically add these to a wordpress site and allow people to search messages, auto tag messages etc.

我遇到的问题是如何最好地对这些邮件进行自动分类.我欢迎任何关于如何最好地解决这个问题以及最有效的编程方式的想法和想法.

The issue I have is how best to auto categorize these messages. I would welcome any thoughts and ideas of how best to go about this and what would be the most efficient way of programming this.

选项 1

使用美味的 API 按主题领域(如金融、技术、商业等)查找标签来源,并按主题查找相关标签:-

Find a source of tags by subject area such as finance, technology, business etc by using the delicious API and find related tags by subject:-

http://delicious.com/tag/finance

http://delicious.com/tag/technology

如果消息包含这些标签,则该消息将被分配到相应的类别.

if a message contains these tags then the message is assigned to the respective category.

我相信这可行,但不确定扫描这些标签的邮件的最有效方法.

I believe this could work but not sure the most efficient method of scanning the message for these tags.

选项 2

查找代表我需要的类别的站点,例如 ft.com、金融经济学家等、技术的 techcrunch 等,然后确定人们使用哪些标签来标记这些站点,并默认确定这些标签是人们与这些网站及其内容堆栈的关系.

Find sites that are representative of the categories I need such as ft.com, the economist for finance etc, techcrunch for technology etc and then determine what tags are being used by people to tag these sites and determine by default that those tags are how people relate to these sites and their content stack.

选项 3

将消息 url 传递给 http://semanticproxy.com/(路透社加来项目的一部分)或使用开放加来 API.我尝试过这种方法,但没有取得多大成功,因为内容的可变深度并不总是足以返回有意义的分类.

Pass the message url to http://semanticproxy.com/ (part of Reuters Calais project) or use the Open Calais API. This I have tried but without much success as the variable depth of content is not always sufficient to return meaningful taxonomy.

这是我通过 calais api 解析的示例消息:-

Here is an example message that I parsed through the calais api:-

原始消息

http://www.meetup.com/opencoffee/messages/6045615/

加来结果

http://www.mashinteractive.com/opencoffee/calais.php

总结

原来如此.我欢迎任何关于方法的想法和想法,以及如何最好地处理选项 1 和 2 的邮件扫描的提示.

So That's about it. I would welcome any thoughts and ideas on methodology and tips on how best to approach the message scanning for options 1 and 2.

仅供参考,迄今为止大约有 1,700 条消息,我猜我可能有 10 个类别,每个类别由 20 或 30 个标签定义.

FYI there are approximately, 1,700 messages to date and I'm guessing I may have 10 categories with each category being defined by 20 or 30 tags.

如果有人愿意帮助开发一个 Wordpress 插件或类来做到这一点,我会非常高兴你加入.请记住,我不是程序员,我只是在边缘修补并假装我是程序员.

If anyone would like to help develop a Wordpress plugin or class to do this I would be more than happy to have you on board. Bear in mind I'm not a programmer, I just tinker around the edges and pretend I am one.

提前致谢

乔纳森首席执行官

人群

推荐答案

您可能想查看 Zemanta,其中包含用于自动标记内容的工具和插件(包括 Wordpress),还可以查看 Common Tag,这是使用 RDFa 表示内容标签的词汇,RDFa 是目前被一些搜索引擎索引的语义网络标准.

You may want to check out Zemanta, which has tools and plugins (including Wordpress) for auto-tagging content, and also have a look at Common Tag, which is a vocabulary for expressing tags on content using RDFa, a semantic web standard currently indexed by some search engines.

这篇关于内容自动分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆