如何自动标记所需的内容、算法和建议 [英] How to auto-tag content, algorithms and suggestions needed

查看:24
本文介绍了如何自动标记所需的内容、算法和建议的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一些非常大的报纸文章数据库,我将它们保存在 MySQL 数据库中,我可以查询它们.

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.

我现在正在寻找方法来帮助我使用一些描述性标签来标记这些文章.

I am now searching for ways to help me tag these articles with somewhat descriptive tags.

所有这些文章都可以通过如下所示的 URL 访问:

All these articles is accessible from a URL that looks like this:

http://web.site/CATEGORY/this-is-the-title-slug

所以至少我可以使用类别来确定我们正在处理的内容类型.但是,我也想根据文章文本进行标记.

So at least I can use the category to figure what type of content that we are working with. However, I also want to tag based on the article-text.

我最初的方法是这样做:

My initial approach was doing this:

  1. 获取所有文章
  2. 获取所有单词,删除所有标点符号,按空格分割,并按出现次数计数
  3. 分析它们,并过滤掉常见的非描述性词,例如他们"、我"、这个"、这些"、他们的"等.
  4. 当所有常用词都被过滤掉后,唯一剩下的是值得标记的词.

但结果证明这是一项相当手动的任务,并不是一种非常漂亮或有用的方法.

But this turned out to be a rather manual task, and not a very pretty or helpful approach.

这也遇到了单词或名称被空格分割的问题,例如,如果 1.000 篇文章包含John Doe"这个名字,而 1.000 篇文章包含John Hanson"这个名字,我只会得到这个词约翰",不是他的名字,而是姓氏.

This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name "John Doe", and 1.000 articles contains the name of "John Hanson", I would only get the word "John" out of it, not his first name, and last name.

推荐答案

自动标记文章确实是一个研究问题,当其他人已经完成了大量工作时,您可能会花费大量时间重新发明轮子.我建议使用现有的自然语言处理工具包之一,例如 NLTK.

Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I'd advise using one of the existing natural language processing toolkits like NLTK.

首先,我建议您考虑实现一个合适的 Tokeniser(比用空格分割要好得多),然后看看 Chunking 和 Stemming 算法.

To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms.

您可能还想计算 n-grams 的频率,即词,而不是单个词.这将处理由空格分割的单词".像 NLTK 这样的工具包为此内置了函数.

You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of "words split by a space". Toolkits like NLTK have functions in-built for this.

最后,当您迭代改进算法时,您可能希望对数据库的一个随机子集进行训练,然后尝试该算法如何标记剩余的文章集,以了解其效果如何.

Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.

这篇关于如何自动标记所需的内容、算法和建议的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆