如何自动标记所需的内容，算法和建议 [英] How to auto-tag content, algorithms and suggestions needed

查看：78 发布时间：2020/5/18 0:36:03 tags nlp tagging

本文介绍了如何自动标记所需的内容，算法和建议的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用一些非常大型的报纸文章数据库，将它们保存在MySQL数据库中，并且可以查询所有内容.

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.

我现在正在寻找方法来帮助我用一些描述性标签来标记这些文章.

I am now searching for ways to help me tag these articles with somewhat descriptive tags.

所有这些文章都可以通过如下网址访问:

All these articles is accessible from a URL that looks like this:

http://web.site/CATEGORY/this-is-the-title-slug

因此，至少我可以使用类别来确定我们正在使用的内容类型.但是，我也想基于文章文本进行标记.

So at least I can use the category to figure what type of content that we are working with. However, I also want to tag based on the article-text.

我最初的方法是这样做:

My initial approach was doing this:

获取所有文章
获取所有单词，删除所有标点符号，按空格分隔，然后按出现次数进行计数
分析它们，并过滤掉常见的非描述性词，例如他们"，我"，此"，这些"，他们的"等.
过滤掉所有常用词后，剩下的就是具有标签价值的词.

但事实证明这是一项相当手动的任务，而不是一种非常漂亮或有用的方法.

But this turned out to be a rather manual task, and not a very pretty or helpful approach.

这也存在单词或名称被空格分隔的问题，例如，如果1.000篇文章包含名称"John Doe"，而1.000篇文章包含名称"John Hanson"，那么我只会得到该单词是约翰"，不是他的名字和姓氏.

This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name "John Doe", and 1.000 articles contains the name of "John Hanson", I would only get the word "John" out of it, not his first name, and last name.

如何自动标记所需的内容，算法和建议 [英] How to auto-tag content, algorithms and suggestions needed

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何自动标记所需的内容，算法和建议 [英] How to auto-tag content, algorithms and suggestions needed

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭